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Measurement in Music 


SAMUEL T. BuRNS, Professor of Public School Music, 
Indiana University 


MEASUREMENTS in music fall into two large groupings: the measure- 
ment of musical talent and the measurement of musical achievement. 
Measurement of musical talent aims to discover the subject’s natural 
musical endowment, with considerations of training ruled out; measure- 
ment of accomplishment aims to measure to what extent the subject has 
changed by contact with the musical environment or stimuli to which 
he has been subjected. 

To these two general fields of measurement of music, a third is some- 
times added, the measurement of music appreciation. This term is so 
loosely used as to make some definition necessary. The term “music 
appreciation” is sometimes used to indicate knowledge about music: 
names of composers, compositions, nationalities, historical and technical 
facts, etc. A music appreciation test of this type is surely but another 
type of achievement test, measuring the extent to which the subject has 
acquired knowledge about music. The term “music appreciation” is some- 
times used to indicate an individual’s like or dislike of various kinds of 
music, or to indicate his reactions to music. Used in this sense, music 
appreciation is probably the result of both natural endowment and train- 
ing, compounded in various degrees. Scrutiny of tests of music appre- 
ciation show that they, too, fall into the two general classes already 
mentioned: either they aim to measure the subject’s capacity to react to 
music, or they measure his acquired information about music. 

With this twofold aspect of musical measurement in mind, we may 
next inquire as to the values of measurement in music. Why should we 
be concerned with trying to find out the extent and quality of native 
music talent? What use can be made of information regarding an indi- 
vidual’s accomplishments in music? 

In answer to the first question, justification for giving attention to 
talent testing may be found both from the standpoint of the individual 
and from the standpoint of society. If any large degree of success or 
failure in the pursuit of music depends on the possession of natural 
musical talent, then it is the duty of educators to attempt to discover 
what constitutes such talent and to devise instruments for its measure- 
ment. The young person considering music as his life’s work will be 
spared the waste of time, money, and effort and ultimate defeat if he 
can know in advance whether or not he possesses the basic abilities nec- 
essary for success in the field of music. 

No less cogent reasons exist for measurement of musical talent from 
the standpoint of society. Music education is expensive education, for 
much of it has to be on an individual basis. Group techniques for the 
development of first-class performers at the higher levels in music have 
not yet been developed. At the present time, and probably for some time 
to come, much music teaching must be done by the individual teacher 
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guiding the efforts of the individual pupil. Such instruction is relatively 
costly. If there are fundamental native abilities essential to success in 
music, then common sense demands that we attempt to discover evi- 
dence of these abilities before society expends any large sums for the 
music instruction of any individual. 

In the field of accomplishment measurement, the same values exist 
as exist for any other subject. Definite knowledge regarding accom- 
plishment in music is of value in measuring the results of teaching, in 
determining norms of possible accomplishment, in classifying and sec- 
tioning for teaching purposes. 

In the time available for the discussion of measurement in music, it 
is obvious that we cannot go into the subject in any great detail. We 
cannot attempt to consider the three or four score musical measurements 
already on the market. Those who are interested should refer to A De- 
seriptive Bibliography of Prognostic and Achievement Tests in Music, 
published by the Bureau of Publications, Teachers College, Columbia 
University, New York. 

I shall attempt to give a brief résumé of the essential practical 
considerations with regard to measurement in music. To what extent 
may we safely use available musical talent tests for the purpose of 
deciding what students shall be accorded the privileges of musical study? 
What pitfalls must be avoided in the use of accomplishment tests in the 
field of music? 

Discussion of the first question necessitates a brief consideration 
of the fundamentals of talent testing. Such consideration begins with 
the work of Carl E. Seashore, of the State University of Iowa, who has 
done extensive pioneer work in this field and is continuing researches in 
the field of the psychology of music. Seashore assumes that musical 
talent is not a unit condition possessed or not possessed by any individual. 
Seashore contends that musical talent is a complex of abilities and 
capacities which, when applied to musical media, result in music and 
musicianship. Seashore considers musical talent as a “hierarchy of 
talents,” any one of which is possessed by every individual to a greater 
or lesser degree. 

Proceeding on this assumption, Seashore has analyzed musical talent 
into five large groupings: 

I. Musical Sensitivity (sense of pitch, of intensity, of time, of 

extensity, of rhythm, of timbre, of consonance, and of volume) 

II. Musical Action (natural capacity for skill in accurate and mu- 

sically expressive production of tones in control of pitch, inten- 
sity, time, rhythm, timbre, and. volume) 

III. Musical Memory and Imagination (auditory imagery, motor 

imagery, memory span, learning power) 

IV. Musical Intellect (musical free association, musical power of 

reflection, general intelligence) 
V. Musical Feeling (musical taste, emotional reaction to music, 
emotional self-expression in music) 
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On the basis of this analysis, Seashore has constructed a battery 
of tests aiming to measure these elemental abilities. Best known of these 
tests are the musical sensitivity and the music memory tests. These tests 
have been recorded, and are thus available to anyone with a phonograph 
and the necessary records. The recorded tests make it possible to dis- 
cover an individual’s ability to distinguish fine differences in the pitch of 
tones; fine differences of loudness and softness; fine differences between 
two time intervals; the extent of the musical memory span; perception 
of relative consonance and dissonance; and perception of rhythm. These 
tests have been widely administered; they have been subjected to con- 
siderable scrutiny in regard to their reliability, their validity, and their 
value as prognostic instruments in the selection of students for music 
study. 

It is usually forgotten, however, that they do not cover the entire 
field of musical talent as outlined by Professor Seashore. Of the five 
large areas suggested by Seashore as comprising musical talent, these 
recorded tests touch upon only two: musical sensitivity and musical 
memory. They do not assume to measure musical action, musical intel- 
lect, or musical feeling. Tests for the latter three phases of musical 
talent mentioned have been designed by Seashore, but they are tests 
which demand laboratory equipment and trained administrators—factors 
which put them beyond the reach of most public school workers. 

The false assumption on the part of many experimenters that these 
recorded tests, used alone, reveal the presence or absence of necessary 
musical talent has led to many unfortunate results. The tests have been 
given to numerous groups of prospective students, and the success or 
failure of these students has been predicted on the basis of the tests. 
The accomplishment of these students as revealed by various accomplish- 
ment criteria at the end of a period of study has then been correlated 
with the predictions. In all cases that have come to my attention, the 
correlations between the prediction and the end accomplishments have 
been low, ranging from a correlation of -.15 in one study to a maximum 
of .73 in another, with the median about .22. 

The basal error in such experiments is in the assumption that these 
recorded tests measure all of the factors that enter into the making of 
success or failure in music. As pointed out above, the recorded tests 
touch upon only two of the five areas into which Seashore has analyzed 
musical talent. There is no assurance that ability to discriminate fine 
differences in pitch as heard in these tests carries with it the motor 
ability to duplicate such fine differences in the playing of a violin; that 
the ability to distinguish fine differences in time intervals between two 
clicks, as given on the record, means that the subject has the motor 
ability to recreate such fine differences in his own performance on a 
musical instrument. 

Even if we accept Seashore’s hypothesis that musical talent is a 
compound or hierarchy of separate abilities which can be isolated and 
measured separately, we are still unjustified in making any dogmatic 
statements regarding the presence or absence of musical talent on the 
basis of these recorded tests alone, for they cover too small a part of 
the entire talent complex. Seashore himself decries any such use of his 
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test material. In a recent article he states, “These tests should not be 
validated in terms of their showing on an omnibus theory or blanket 
rating against all musical behavior, including such . . . situations as com- 
position, directing, voice, piano, violin, saxophone, theory, administration 
or drums, because there are hundreds of other factors which help to 
determine job analysis in each of such fields. . . . I have insisted that even 
the most superficial rating for selection or placement in musical training 
or adjustment should be based upon a careful case history and a reliable 
audition with the profile of measurements in hand.” 

Confidence in the usefulness of the tests as the sole means for the 
selection of students for music study is further weakened by the doubt 
expressed in some quarters as to the soundness of the hypothesis upon 
which Seashore has built his theory of musical talent and constructed 
his tests. This questioning of the hypothesis is stated effectively by 
Dr. James L. Mursell in a recent article, “The Issues of the Test Dis- 
cussion.” In answer to the question “Is there such a thing as general 
musical talent?” Mursell says, “Seashore believes not. . . . He holds that 
musicality is not one single factor in the human mental make-up, but 
consists of a large number of specific and limited traits, of which the 
tests measure six. This is one representative view of the nature of 
human abilities. But the reader should know that it is far from being 
universal among competent psychologists. For myself I am unable to 
accept it. While it is clear that we must not think of musical talent as 
a sort of faculty, yet one may not unreasonably believe that all musical 
people . . . have something in common. This we would call their musi- 
cality or their talent and it might well consist of certain excellencies of 
hearing, innate and acquired, which the Seashore tests, dealing on the 
whole with sensory abilities, are not even designed to measure. We 
have no certain knowledge on such a point, but this position may be 
counted at least respectable psychologically. . . . The way is open, 
hopefully open, for research on the construction of tests different in 
principle from the Seashore tests. Such efforts are being widely made. 
What we have are really two working hypotheses for the direction of 
research.” 

In the face of these uncertainties as to the efficacy of the Sea- 
shore tests in measuring musical talent, what attitudes toward talent 
testing can the music teacher or administrator take in regard to the use 
of talent tests within the schools? It is evident that we cannot safely 
adopt any policy of excluding students from the opportunity of music 
study on the basis of these tests alone. Other factors that make for 
success or failure and are not measured by these tests operate to too 
great an extent. Advising any student against attempting music study 
solely because of low rating in one or more of these tests is apt to bring 
about injustice in large numbers of cases. 

An illustration drawn from my experience will be of value here. 
Several years ago I administered the Seashore tests and some others 
of my own construction to students in the beginning instrumental classes 
of Medina County, Ohio, where I was at that time serving as county 
director of music. On the basis of the tests we predicted success or 
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failure on the part of the students. We kept no students out of the 
classes on the basis of the tests; everyone was admitted and permitted 
to receive the instruction. At the end of the year we classified the stu- 
dents into successes or failures on the basis of accomplishment in the 
classes as evidenced by the teachers’ ratings and on the basis of a 
performance examination given to all students in the county by the 
county director. Of all those predicted failures by the tests, a third 
of them succeeded in learning to play their instruments acceptably well; 
of those predicted successes, almost the same percentage failed. I re- 
peated the experiment for three successive years, refining the techniques 
of test administration and the statistical procedure each time. The re- 
sults were progressively worse; the more careful we were in our tests 
and computations, the lower was the correlation between the predicted 
results and the end results. 

These experiments convinced me that we should never be guilty of 
discouraging any pupil at the beginning stages from undertaking the 
study of music on the basis of any talent tests. In my opinion the thirty 
or more per cent of predicted failures who succeeded were justified in 
claiming a chance to prove what they could do. 

Yet the fact that seventy per cent of the predicted successes did 
succeed, and that seventy per cent of the predicted failures did fail, 
suggests that one might safely use the test results in a positive way to 
encourage students who rate high in the tests. The tests, although not 
measuring the whole of musical talent, apparently indicate something 
which makes for success in a majority of cases. The tests can probably 
be safely used as a means of discovering talent and as a basis for 
encouraging study. Seashore himself suggests that practical use of the 
the tests when he says, “There is a positive use . . . in that a relatively 
good profile [in the tests] may lead to case history, further measure- 
ment and auditions for the purpose of discovering and encouraging 
talent.” 

As the- basis of this discussion I have used the Seashore tests be- 
cause they are best known and have been subjected to more extensive 
experiments than any other talent tests. There are many other talent 
tests, most of them based on the same assumptions as the Seashore bat- 
tery. For all of them the conclusions reached in the discussion of the 
Seashore tests would hold: that we know too little about the whole sub- 
ject to base any negative conclusion in a specific case upon the test 
results alone before a trial has been given. Honesty and justice demand 
that we tell any student who rates low in the tests that, in spite of the 
low rating, he may succeed in music study, for the test does not tell the 
whole story. 

Before leaving this subject of talent testing I must mention another 
practical phase of the subject, and that is the use of so-called talent 
tests for promotional purposes by commercial firms. I have in my files 
a letter from a manufacturer of musical instruments which makes such 
modest claims as the following: “Do you need more students in your 
band? Are you interested in finding your talented students, the ones 
who are sure to succeed and become musical leaders? Do you want to 
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quit wasting time on beginners who never can learn? We can solve all 
these problems for you.” 

Following these claims is a description of a talent test designed 
presumably to measure natural musical aptitude. 

The test consists of between forty and fifty items purporting to 
measure tonal sensitivity, rhythmic sensitivity, harmonic sense, and 
appreciation of music. The items are variously weighted so that the final 
score is stated in percentage, the perfect score being a hundred. Informa- 
tion is not given as to the basis of selection of the items used. Nothing 
is said regarding their validity, their reliability, or the reasons why one 
item is weighted 4 and another item, seemingly just as good, is weighted 2. 

Making such claims for such tests is educational charlatanism, and 
no conscientious school man, knowing the facts, can allow such fraud to 
be perpetrated on his unsuspecting students and their parents. One 
dealer to whom I pointed out these deficiencies in his tests decided that 
rather than run the risk of setting up a false idea of musical ability in 
any student he would give all students passing grades in his future 
sales campaigns. So far as I know he is still operating on this policy. 
He gives grades on the tests as before, but no matter how low the grade, 
he gives every child a certificate telling him he has musical talent and 
ought to study an instrument. Such procedure reduces the whole testing 
scheme to the level of a farce. But this dealer is at least not making 
children believe they lack musical talent on the grounds of scores on an 
unreliable instrument. When profound scholars, who have devoted years 
of intense study to the subject, refuse to countenance any sweeping claims 
for their tests as means of sifting out the musically talented from the 
untalented, surely we are committing a grave error when we permit the 
sales representative of a commercial firm to make such claims for an 
unreliable, unvalidated instrument, constructed how or by whom no 
one knows. 

Time will not permit of any extended discussion of the field of ac- 
complishment testing. Many tests aiming to measure many different fields 
of music accomplishment are in existence. Most of them deal with musi- 
cal information and facts that are definite, concrete, and measurable. 
Many of them have been validated by reference to specific courses of 
study and contain a sufficient number of items to be highly reliable. 

Care must be exercised in their use to make sure that they measure 
the significant things of the music program. Any person with normal 
intelligence can learn the facts about composers and their compositions, 
can learn key signatures, note names, time values, the meaning of clefs, 
bars, rests, etc. But are these easily measurable facts the important 
things in music study? I have known classes whose members would rate 
high in all such factual information, who knew few songs, who sang 
poorly and without enthusiasm, who had an antagonistic attitude toward 
music, who, although they knew the names of composers and composi- 
tions, did not recognize the compositions when they heard them, and who 
preferred music of a much lower grade of excellence. I know a school 
system where the monthly grade in music on the report card is based on 
a written test on theoretical facts. The children of that school system 
sing in a very mediocre fashion and know few songs. I asked the super- 
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visor why so little music material and actual musical experience were 
given to the classes. Her reply was that it took so much time to learn 
the theoretical material for the written examinations that there was 
not much time left to learn the music. 

The great contribution that music makes to life is that it is a 
source of joy and happiness; it is a means of emotional stimulation and 
expression; it is an avenue whereby one can enter intimately into the 
emotional life of other individuals, other peoples, other cultures. These 
great values of music study and music experience cannot be measured by 
any form of written test and care must be taken that, in any program of 
accomplishment testing in music, we are not laying so much emphasis 
on the incidentals that the essentials are neglected or pushed into sec- 
ond place. 

Knowledge of theoretical facts about music is justified as a tool for 
making music more serviceable, as a means of unlocking musical experi- 
ences more easily. And in so far as we teach theoretical facts we are 
justified in attempting to measure their acquisition. But we must always 
remember that they are a means, not an end; that the end of music teach- 
ing is love for music, delight in its creation, happiness in its presence, 
reaction to its movement; and that these ends cannot be scientifically 
measured. 
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Present Tendencies in Educational 
Measurement 


CHARLES W. ODELL, Associate Professor of Education, 
University of Illinois 


DurRinG the third quarter of the nineteenth century Sir Francis 
Galton in Britain, and a few years later J. McKeen Cattell in this coun- 
try, began their now well-known studies along the lines of individual 
differences and the measurement of mental abilities. Near the end of 
the century, in 1894 to 1897, J. M. Rice carried out a project that is 
commonplace in these days, but at that time it was an almost if not 
a quite unheard-of undertaking. He gave uniform tests in spelling, 
arithmetic, and language to the children in a number of school systems 
and used the results as a basis for conclusions concerning the effective- 
ness of instruction and related matters. Influenced by his knowledge of 
Galton’s work, his graduate study under Cattell, and his acquaintance 
with Rice’s results, E. L. Thorndike early acquired an interest in the 
field of mental measurement. This interest led him to write a book, 
An Introduction to the Theory of Mental and Social Measurements, gen- 
erally accorded the distinction of being the first book in this field. Only 
a year later the two French workers, Alfred Binet and Theodore Simon, 
published the first edition of their now world-famous individual in- 
telligence scale. With these latter two studies, in 1904 and 1905 re- 
spectively, the modern educational measurement movement may be said 
to have gotten under way. 

As is true with many new movements, progress was slow for a 
time. Within a few years, however, the work of such leaders as Ayres, 
Buckingham, Courtis, Goddard, Monroe, Stone, Terman, Trabue, and 
others, many of them colleagues or students of Thorndike, both popu- 
larized the movement and established a rapidly increasing body of 
content. Despite opposition, some of it determined and bitter, there was 
wide and enthusiastic reception of the movement during the period up 
to about 1920, accompanied by very rapid growth in the number of 
measuring instruments produced, the number of individuals to whom 
they were administered, and the number of books, articles, and other 
published material devoted to this phase of education. Although many 
of the leaders were critical and careful from the first, the general 
tendency was, as it frequently is under such circumstances, to accept 
the novel means provided uncritically. After this period of rapid 
expansion came one sometimes referred to as the depression in educa- 
tional measurements, a term that is too strong but yet indicates some- 
thing of its characteristics. Within the last few years, since perhaps 
1930, there has been somewhat of a revival of interest and activity in 
this field, in general on what appears to be a sounder basis than that of 
earlier years. It is with the developments of this recent attack on the 
general problem that I wish to deal today. In so doing it appears as if 
the present purposes would be an appropriate place to begin. 
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The most marked characteristic of these purposes, as compared with 
those of the early days of the movement, is probably their breadth. They 
are stated from so many points of view that it is extremely difficult to 
compile a single list that is sufficiently inclusive to indicate their scope. 
They recognize both a broader general field within which measurements 
may serve to advance educational progress and efficiency, and also many 
more of the smaller, more specific objectives which are involved in the 
more general ones. To begin with, we may well consider an inclusive 
statement of the one all-embracing, or almost so,- purpose they should 
serve. Before proceeding to do so, however, let us recall that all meas- 
uring instruments employed in connection with educational activities may 
be grouped into two main classes on the basis of their direct or indirect 
relationship to school pupils and students. This is to say, they are either 
tests, scales, questionnaires, rating cards, or other instruments applied to 
the measurement of the pupils or students themselves, or they are such 
instruments applied to such features of the educational system and its 
activities as buildings, textbooks, teachers, budget practices, provisions 
for individual differences, supervisory programs, and so forth. Practi- 
cally all of our specific consideration will be devoted to those of the 
former type. 

For this, the more important of the two classes just mentioned, one 
statement of the purpose of all testing and other measurement pro- 
cedures is that they are to enable the teacher, the counselor, the super- 
visor, the administrator, and anyone else who deals with the educa- 
tion of the child, to secure more comprehensive, objective, reliable, valid, 
and therefore more useful, measures of him, so that the curriculum, 
teaching methods, buildings and their equipment, and all other factors 
that affect his development may be optimally adapted to his individual 
needs. To fulfill this purpose we must measure and understand the 
child when he first comes to us, in the first grade, kindergarten, or even 
nursery school, and continue to do so as he matures, with especial at- 
tention to making such measurements as will afford us evidence of the 
worth of the influences that we bring to bear upon him. Only in so far 
as we can accomplish this task can we place confidence in our attempts 
to educate him. 

We may now proceed to list more detailed, but still quite general, 
objectives which our testing, or better evaluating or appraising, pro- 
gram should seek to attain. I do not wish to be understood as offer- 
ing this list as a complete one, but rather as one containing a number 
of the outstanding objectives suggestive of the scope of the undertaking. 
From the standpoint of more or less immediate dealing with the child 
we now have these purposes, among others, emphasized: 

1. The discovery of the initial status of the child in intelligence, atti- 
tudes of many kinds, health and physical condition, environmental 
background, temperamental qualities, ambitions and interests, ability 
in the school subjects, and other characteristics 

2. Measurement of his reactions to the procedures employed in his 
education 

3. Revelation to the pupil himself of so much of the information just 
referred to as will be to his profit 
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4. The actual learning that results from the taking of tests of certain 
types and the study of the results therefrom 

5. The fostering of an attitude, on the part of the child, of responsibility 
for his own development, with the recognition that, since the school 
exists for him and the teacher is there simply to help him in his 
development, he should endeavor to understand the activities of the 
school and promote his own education rather than expect the teacher 
or someone else to do it for him 


In view of the time at my disposal and the other points with which 
I wish to deal, there is not time to pursue the division and subdivision 
of objectives into anything approaching full detail. I do wish, however, 
to suggest that measurement should serve to (a) provide a basis for 
the placement of pupils where they will receive the most help, (b) reveal 
the specific weaknesses and needs of each child, (c) indicate how nearly 
up to his capacity each child is working, (d) make known these individual 
interests and abilities in which each pupil may make his greatest con- 
tribution, (e) determine the approximate goals for which each child 
should strive, (f) evaluate instructional and supervisory activities, (g) 
indicate the adjustment problems of the individual, (h) give evidence 
as to the worth of textbooks and other materials which pupils employ, 
and so on almost without end if more specific statements are used. 

The most important and significant portion of the field covered by 
educational measuring instruments is that represented by the second of 
the general purposes stated above—measurement of the child’s reac- 
tions to the school’s procedures, or, in other words, of the extent to 
which he exhibits the desired outcomes of the educative process. Within 
this phase a mueh more comprehensive concept of what should be tested 
has developed recently, and a large amount of effort has been ex- 
pended in attempting to produce instruments to measure it. No longer 
do we accept as even fairly satisfactory the measurement of memorized 
information, more or less routine and mentally mechanical skills, and 
the very simple and direct application thereof. Without discarding such 
measures as worthless, we now see them as only supplementary to the 
more significant ones which will be suggested. 

Probably the most outstanding recent work in both setting up lists 
of objectives and constructing instruments to measure them is that being 
done under the leadership of R. W. Tyler, J. W. Wrightstone, Louis 
Raths, and others in connection with the Eight-Year Study of the Pro- 
gressive Education Association. Since I shall give a rather prominent 
place to their work, it appears appropriate to describe the study briefly. 

At the instance of the Progressive Education Association some 280 
colleges and universities agreed, about five or six years ago, that for 
eight years they would waive their usual entrance requirements for the 
graduates of thirty carefully selected secondary schools, public and 
private, and accept students therefrom on their recommendations. One 
of the conditions imposed was that the secondary schools concerned 
should develop means of securing and transmitting to the higher insti- 
tutions such information about each student as would enable them to 
meet his needs. For this reason, and for the purpose of attempting to 
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evaluate the outcomes of progressive curricula and methods of instruc- 
tion, the Progressive Education Association, in codperation with the 
thirty schools, chose a staff under whose leadership they have been 
attacking the problem. 

After careful study, the staff and the committees working with it 
selected ten aims, the accomplishment of which it would endeavor to 
measure. These aims are: 

Various aspects of reflective thinking 
Interests, aims, and purposes 
Attitudes 

. Social adjustment 

. Creativeness 

. Study skills and work habits 

Fund of vital information 

. Appreciation 

. Social sensitivity 

10. Functional philosophy of life 


Each of these was further subdivided into more detailed objectives 
which were to serve as the basis of test construction. I shall not take 
time for all, but, in order to show their nature and scope, I shall give 
those included under the first and third headings in the tentative lists. 
Those under the first, reflective thinking, follow: 


Interpretation of data 

Application of facts and principles to new situations 
Nature of proof 

. Relevancy in thinking 

Functional thinking 

Consistency of belief 

Ability to generalize 

. Logical thinking 

Scientific method 

Proposing and testing fruitful hypotheses and eliminating un- 
promising ones 


Fe ss 


Under the third heading, attitudes, were listed those toward the 
following: 
a. International affairs 
b. Democracy, individualism, and labor 
c. Religion, the family, and politics 
d. Propaganda 
e. Militarism, racialism, and nationalism 
f. Science 
g. Taxation 
h. Health regulations 
i. Proof 
Flexibility of outlook 


j. 
Another point of departure in the broadening of objectives has 
been that of the school subjects and subject fields. Within almost every 
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one, both elementary and high school, there have been expanded sets 
of aims prepared to replace the much narrower ones over which tests 
were constructed in the early days of the movement. We are, of course, 
far from having attained the goal of having available measuring in- 
struments for all of these detailed aims, but much time and attention 
is being given to their development. 

As an example of this tendency I should like to discuss tests in 
two or three fields. When standard tests began to appear in the field 
of social science, for example, they were almost entirely concerned 
with dates, events, characters, and other matters of fact in history pre- 
sented in simple relationships with one another, with legal and other 
information in the field of civics, and with other similar outcomes. Now 
we find that such workers as Kelley and Krey and their associates have 
made available a much wider range of instruments and one which tests 
the significant outcomes of the social studies rather than the mere tools 
and preliminaries to their acquisition. A number of these outcomes 
follow: 


1. Understanding of important institutions by means of which society 
functions, including their principles and ideals. These include: 

Local, state, national, and international political institutions 

The same types of economic institutions 

Social institutions, such as domestic, religious, and ethical 

Educational institutions 

Esthetic institutions 

Recreational institutions 


SF 


2. Skill in employing sources of information about society. These 
include: 
a. Oral and printed current gossip 
b. Oral and printed reasoned discussion 
c. Real and pictured social activities 
d. Present and past material achievement 


3. Acquiring points of view, attitudes, ideals, and interests. Among 
these are: 
a. Perspective in current affairs 
b. Historical mindedness 
c. Locational mindedness 
d. Concern for the common good 
e. Racial, religious, national, and social tolerance 
f. Leisure time interests 


4. Social orientation 


Work skills employed in this field. Some of these are: 

a. Reading to locate information 

b. Summarizing 

ec. Outlining 

d. Interpreting cartoons, charts, graphs, tables, and so forth 

e. Using books, their tables of contents, indexes, and other aids 
f. Map reading 
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As a second example, we may take the field of the natural sciences. 
Among the recognized objectives upon which tests have been made or 
efforts along that line are under way are these: 


1. Knowledge of principles and facts 

2. Understanding of the technical terminology and symbols 

3. Ability to identify structures and processes and their functions 

4. Familiarity with reliable sources of information 

5. Ability to recognize unsolved problems 

6. Interest in natural phenomena and in solving problems involving 
them 

7. Ability to draw reasonable generalizations from experimental data 

8. Ability to plan experiments to test hypotheses 

9. Ability to apply scientific principles to new situations 

10. Skill in laboratory technics 


In the field of English, too, the early standard tests in this subject 
were chiefly concerned with facts about writers and their works, gram- 
matical rules, common language usages, correct spelling, capitalization, 
punctuation, and similar matters of fact. Of course all these should be 
tested, and are not being neglected by recent tests, but many other 
elements are being introduced. Among these are: 

1. Ability to read literary productions easily and comprehendingly 

2. Appreciation and critical judgment of what is read 

3. Comprehensive acquaintance with literature and literary history 

4. Vicarious widening of experience through reading 

5. Development of desirable attitudes, ideals, and interests through 

reading 

Competence in the use of libraries and their resources 

7. Familiarity with conventional usages and tools of language 

8. Formation of correct habits of use of these in both oral and written 
expression 

9. Ability to organize the results of thought and experience into larger 
units of effective expression 

10. Reasonable desire to speak and write for the pleasure of both self 
and others 


Although these examples by no means exhaust the fields dealt with 
and although the three fields are only a small fraction of those included 
in our curricula, I believe that these are sufficient to afford you a clear 
idea of what I mean by the broadening expanse of outcomes which we 
are now attempting to measure by more or less standard tests as well as 
by informal ones. A prime factor in this expanding conception of the 
function of measurement has been the shifting of emphasis in the de- 
termination of test content from analysis of course content to analysis 
of the desired behavior outcomes, or changes, in pupils. 

Next in order it seems appropriate to consider the actual means by 
which outcomes of the sorts suggested are being measured, or at least 
by which attempts to measure them are being made. The examples em- 
ployed to illustrate some of these are largely drawn from the work of 
the staff of the Progressive Education Association study already referred 
to, but I do not wish you to get the impression that they are the only 


ets 

sts 

se, 

in- | 

ion 

in 

eld 

1ed 

re- 

ow 

ive 

sts 

ols 

1eS | 

ty 

se 

_| 

| 


16 BULLETIN OF THE SCHOOL OF EDUCATION 


group engaged in producing such materials. Furthermore, I should like 
to emphasize that these are merely examples of what standard and near. 
standard tests may do, but that they should serve to stimulate wide use 
of similar instruments made by teachers for their own use. 

One type of outcome which has received considerable attention is the 
ability to interpret data. In general this type of test consists of a 
paragraph or other selection presenting certain data, followed by a num- 
ber of possible interpretations to which the student is to react in some 
designated way. In some cases the number of possibilities is only four or 
five, as in the typical multiple-answer test, or even two, as in the 
alternative; in others a dozen or more are listed. A variation of this 
type is to have pupils give their own conclusions or inferences and then 
to rate these by means of a scale similar to an English composition 
scale in form and use. This method is, of course, less objective, but it 
may be made reasonably objective and embodies the advantage of more 
fully measuring pupil reactions. 

A test quoted by Raths as having been employed in a certain sec- 
ondary school is a good example of this type of test. It consisted of a 
table giving the expenditures of the national government for each of 
its chief functions for each of several periods. It was followed by a 
number of statements of conclusions or interpretations, of which the 
following are examples: 

“More money will be spent in the primary government functions in 
1939 than in 1936. 

“About a third of the government expenditures in 1936 went for 
relief. 

“The national debt was lower in 1936 than in 1920. 

“Interest paid on the debt in 1920 was over 40 times as great as in 
1910. 

“The primary costs of government increased every year from 1910 
to 1936.” 

Pupils were instructed to study each of the statements carefully 
and to mark it as being best characterized by one of these five de- 
scriptions: 

“The interpretation is so fully supported by facts that you could say 
it was true. 

“The statement is supported by the facts given to the extent that 
you could say it was probably true. 

“The statement is one for which the facts given are very insufficient, 
making it impossible for you to judge it one way or the other. 

“The facts which are given suggest.that the statement is probably 
false. 

“The facts which are given contradict the statement so that you can 
say it is false.” 

A further step that is sometimes taken is to include reasons why 
the conclusions have been chosen as correct and to ask that the pupils 
indicate the correct reason as well as select the correct conclusions. For 
the example just given, such reasons might be: 


a. The trend has been constant for so long that it will probably 
continue to be so. 
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b. The demands of our people for more governmental services are 
increasing. 

ce. Much of the total amount spent for relief was not so labeled, 
but was expended through other channels and therefore included under 
other headings. 

d. The rate of interest was not the same in 1920 as in 1910. 

e. It is very likely that changes in economic and social conditions will 
soon disturb the trends of the last few years. 

f. Since taxes are increasing, less money needs to be borrowed. 


The addition of this feature makes it possible not only to find out 
whether the pupil makes correct judgments of the validity of the inter- 
pretations and of the correct reasons therefor, but also to ascertain 
whether in his incorrect judgments he is thinking logically and basing 
these judgments upon reasons consistent with their meaning. 


Another type of thought process is the application of principles to 
new situations. An example of this may be taken from one of the tenta- 
tive chemistry tests worked out by the staff of the Progressive Education 
Association Study. One exercise in it consists of a paragraph as follows: 

“A water solution of hydrogen chloride is placed in a glass vessel 
containing two separated carbon electrodes which are connected with 
the opposite poles of a storage battery. What will happen at each 
electrode, and why?” 

This is followed by four statements: 


“Hydrogen gas will bubble off the negative electrode. 
“Hydrogen gas will bubble off the positive electrode. 
“Chlorine gas will bubble off the negative electrode. 
“Chlorine gas will bubble off the positive electrode.” 


As many of these are to be checked as are correct, after which the 
supporting reasons are to be chosen from a given list, as in the previous 
exercise. 


Another type of testing procedure that may be employed in science 
has been suggested by Buckingham and Lee. They first tell pupils 
to write papers upon a given subject. Next the testers present to the 
class a list of true-false statements to be marked according to whether 
they are true or false, then again as to whether they deal with points 
necessary to the papers. After this pupils are to supply the additional 
facts needed for their papers and then, finally, write the papers. By 
examination of the result, the teacher can analyze and diagnose pupils’ 
mental processes much more completely than if only the completed 
papers were available. 

Ability in outlining is one of the work skills that is generally ac- 
cepted as a worthy objective. Accordingly a number of attempts to 
measure it have been made. One approach is to give a selection, followed 
by a number of possible heads and subheads, titles, and perhaps even 
symbols. Pupils are then to fill in a blank form with the proper items 
selected from among those given. Another procedure is to give a 
jumbled list of heads and subheads without a blank form, to be properly 
arranged. Still another is to include in the jumbled list items which 
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will serve to make two, or even more, outlines, so that pupils must sort 
them out as well as put them in proper coérdination and subordination. 

English is a subject in which there are important outcomes that 
the objective test methods of a few years ago had not touched. I do 
not mean to imply that such methods have now been found to measure 
all outcomes, but at least less subjective methods than those formerly 
employed are being devised for some of them. One of these is to 
present to pupils a paragraph composed of sentences of equal length to 
be rewritten with no changes except such as are necessary for improv- 
ing its readability. Instead of this, one composed of short, choppy 
sentences, or of long, involved ones, may be presented for improvement. 
Another similar type of task is to reduce a series of verbose sentences 
to conciseness without sacrificing their meaning. 

Not nearly all of the measurement of desired outcomes is being 
made by means of tests, in the narrow sense of the word. With the 
recognition of the importance of ideals, attitudes, and interests, scales 
and questionnaires have come into use as measuring instruments con- 
nected with classroom instruction as well as with other educational 
activities. Among the leaders in the construction of such instruments 
are Thurstone, at Chicago, and Remmers, at Purdue. As examples, I 
should like to mention several. 

Undoubtedly a significant outcome of social studies courses, as well 
as of the whole school, is the attitudes which pupils form toward social 
institutions. Miss Kelley, working under Remmers, has prepared a scale 
which may be employed to measure such attitudes. It consists of forty- 
five statements arranged in order from superlatively favorable to the 
opposite extreme. Pupils are to check those with which they agree and 
the median one checked by each is his score. Probably many of you are 
familiar with this instrument, so I shall not go into further detail 
concerning it. 

Another of the scales of this general type, although not just the 
same in form, is Wrightstone’s Scale of Civic Beliefs. Instead of a 
series of statements in order from one extreme position to another, it 
consists of a number to be marked as to whether the subject agrees 
or disagrees with the statement. The total score is a measure of liberal- 
ism or conservatism; those on the four parts measure racial attitudes, 
international attitudes, national political attitudes, and attitudes toward 
national achievements and ideals. I give one example of each: 

“The American Indians should be kept on reservations.” 

“The United States should remain isolated from European nations.” 

“It is useless to vote in cities where elections are controlled by 
strong political bosses.” 

“In every respect the United States schools are better than those 
of any other nation.” 

Similar scales may be employed in the measurement of apprecia- 
tion. For example, another one of the Purdue series, by Hadley, may 
be cited. It consists of thirty-eight statements expressive of attitudes 
toward any poetic selection ranging from “This is undoubtedly one of 
the world’s masterpieces” through “I like this poem very much,” “This 
poem is graceful,” and “This poem is too emotional to appeal to me,” 
to “I wish I had never even heard of this poem.” 
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Another approach to the appreciation of literature, or, for that 


ort matter, of any subject or activity, is through the use of questionnaires. 

ion. For example, the Progressive Education Study has made use of such 

hat questions as these: 

do “If you were to write a critical appreciation of this novel, would it 

ure be a more or less favorable one?” 

rly “After you had started this novel, did you at any time leave off 

to your reading of it in order to engage in some other more attractive 

to activity ?” 

sori “Once you had finished this novel, were you genuinely glad to be 

through with it?” 

sie “Do you think it likely that you will make an effort to read more 
of this author’s work within the very near future?” 

ing All of the tests and other methods of measurement illustrated so 

the far are of the pencil and paper variety, with responses to be given by 

ales pupils. There are, however, other types which are proving their worth. 

-on- A large group may be included under the general head of records. The 

ynal interpretations of many of these must of necessity be quite subjective, 

ants yet they should not be overlooked because the type of evidence they 5 

s, I afford is so valuable and difficult to obtain by other means. One rather 3 
thoroughgoing treatment has listed the following sixteen types of - 

well records worth securing and studying: a 

cial 1. Personal patterns of goals—including school work, home, friends, : 

cale sports, reading, hobbies, and so forth 

rty- 2. Significant experiences—samplings of days or weeks 

the 3. Reading—amount, kinds, degree of understanding, emotional reac- 

and tions, and so on 

are 4. Cultural experiences—plays, movies, radio programs, museums, 4 

tail travel, music, religious activities, scientific experiences, and others " 
5. Creative expression—writing, art, music, drama, laboratory work, . 


the and so on 
. Anecdotal records of significant experiences—especially those con- 
r, it nected with the objectives of the school 


rees 7. Conferences—their outcomes, decisions reached, unsolved problems 

ral- 8. Excuses and explanations offered by pupils 

des, 9. Test and examination scores 

vard 10. Health and family history—including annual examination data as to 
sight, hearing, strength, maturation, and so forth, also illnesses, 

a habits, and other points 

ms. 11. Oral English 

| by 12. Pupil affairs—clubs, athletics, social events, self-government, and 
so forth 

hose 13. Personality ratings and descriptions 

: 14. Questionnaires and inventories of family and personal backgrounds, 
>cia- 


interests, preferences, opinions, and related items 

may 15. Courses carried and activities engaged in 

. Administrative record—data as to entrance, progress, honors, dis- 
ciplinary action necessary, and other official records 


udes 16 


” 
me, 
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Of all these perhaps the anecdotal records are assuming the most 
importance in the thinking of many workers. For example, this ap- 
proach has been employed in dealing with the inculcation of good man. 
ners. Pupils collected instances of the violation of good manners and 
from them derived a relatively short list upon which to concentrate. 
Observation and anecdotal recording were then used to determine the 
status of the group with regard to the observance of good manners, a 
period of emphasis and attention to them was given, and further observa- 
tion and recording followed. The results were compared with the first 
observation and recording to determine the effectiveness of the period 
mentioned. 

A somewhat similar but broader undertaking had to do with the 
testing of the understanding and putting into practice of desirable social 
relationships. Anecdotal records of two main types of activities, individ- 
ual self-initiated ones and coéperative ones, formed the basis of this 
measurement. The former included such activities as voluntarily bring- 
ing in clippings, exhibits, books, charts, and other contributions, sub- 
mitting data from outside the school, from trips, observations, and so 
forth, presenting reports on self-directed observation, and suggesting 
methods and material activities for developing a project or problem. 
Among the codperative activities were those of helping other pupils or 
the teacher with projects or problems; offering a book, a chair, a pencil, 
a tool, or some other helpful article to the teacher, another pupil, or 
a visitor; and responding quickly to requests for quiet, help, or some- 
thing else. 

With one more example I shall close my list of examples of the 
measurement of pupils’ responses. Artistic progress has usually been 
measured by comparing the final with the initial product. Now it is 
suggested that pupils should preserve and hand to the teacher all the 
intermediate sketches or other steps, so that by critical examination of 
them the teacher may much more thoroughly trace the mental processes 
of the learner and thus be in a position to afford him much more under- 
standing help. 

Finally among the types of measurement presented I should like to 
include one of schools rather than of pupils. For quite a number of 
years our regional associations and institutions of higher learning have 
devoted much thought and energy to developing means of rating 
schools. Within the last few years we have had what promises to be a 
very significant contribution to the attainment of this end. I refer to 
the work of Eells and his collaborators who are carrying on the Codpera- 
tive Study of Secondary School Standards. They have developed a rating 
plan containing 100 scales, with shortened forms of 50 and 25, on each 
of which individual schools may be rated in comparison with schools 
in general and with those of each of a number of types. In each case 
the rating given a school is based upon data concerning the characteristic 
being scored, data which are as objective as possible but in many cases 
involve subjective judgments of those competent to give them. The 
seales are grouped in nine divisions: curriculum, pupil activities, library, 
guidance, instruction, outcomes, staff, plant, and administration. I shall 
not take time to describe this further, but earnestly recommend that 
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all of you who are interested in the secondary school become familiar 
with it if you have not already done so. 
I believe that what I have just been presenting has the most practi- 


‘ eal value of any portion of this paper, if I may be egotistical enough 


to assume that any portion of it has practical value, but in the few 
minutes remaining I should like to mention a number of other recent 
tendencies in educational measurements. 

One of these is the fact that, through the efforts of many persons, 
schools, and organizations, test construction and distribution is becoming 
more professionalizing and less commercialized. I do not mean to sug- 
gest that the business of the leading commercial publishers of tests 
is decreasing, nor that they have not produced and are not yet produc- 
ing many excellent tests. It is, however, true that under the leader- 
ship of such organizations as the Codperative Test Service, the Educa- 
tional Records Bureau, and others, tests of superior quality are being 
made available, on a cost basis, and constant effort is being put forth 
to improve them. Connected with this tendency is the appearance of a 
few series of tests, of which that of the Codperative Test Service is 
the outstanding example, similar in general form, in meaning of scores, 
and otherwise adapted to comprehensive programs. 

Another tendency that has become quite marked and appears to be 
still on the increase is the use of tests in institutions of higher learning. 
Since shortly after group intelligence tests were devised their use in 
such institutions has persisted, but the last few years have witnessed 
much attention to subject-matter examinations in colleges and uni- 
versities. This has been true especially in such schools as the Univer- 
sity of Minnesota and the University of Chicago, where there have been 
more or less thoroughgoing curriculum reorganization and other changes. 
Each of these institutions, for example, has set up a fairly elaborate 
procedure for the construction of the final comprehensive examinations 
which now constitute an important element in their educational plans, 
procedures which involve the codperation of experts in the field of 
measurement and in those of the subjects concerned. Even the informal 
tests given by instructors, those which are for diagnostic and instruc- 
tional purposes only and do not affect the final marks directly, are 
receiving careful consideration. 

There has been somewhat of a revival of interest in the improve- 
ment of the essay examination. I am glad to have been among those 
who, when some of our most prominent workers in this field seemed to 
condemn this type of instrument to practical elimination from our 
schools, refused to accept such a.verdict and insisted that it should be 
retained. Despite the new forms of short-answer exercises being de- 
vised, there are still, and I believe will continue to be, some of the 
desired outcomes of instruction for which the validity of well made 
and carefully scored essay or discussion examinations exceeds that of 
the other type. Recently there have appeared a number of suggestions 
as to how the administration of essay examinations may be so im- 
proved as to render them more effective measuring instruments. .It 
has been shown that their reliability can be much increased over that 
found in the well-known investigations of Starch and Elliott and 
others. In saying this I do not wish to be understood as recommending 
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them in preference to those of the short-answer type when the latter 
really measure the ability or trait desired. 

Much attention is now being paid to more efficient means of scor- 
ing tests. When short-answer tests, whether standard or informal, first 
appeared, the use of strips of cardboard with the correct answers thereon 
and so arranged as to be easily matched with pupils’ responses seemed 
a very efficient method. Later so-called “self-scoring” tests appeared, 
with carbon strips through which responses were carried to sheets where 
their positions indicated their correctness or incorrectness. Also came 
the scoring stencil with what we may call windows, through which 
correct answers are revealed and all others concealed, a method which 
permitted considerably more rapid scoring than the use of strips. 
Finally, test scoring machines of various types and degrees of effective- 
ness appeared. The outstanding one is undoubtedly that manufactured 
by the International Business Machines Corporation. This not only 
scores tests of many types and varieties, some as rapidly as fifteen 
per minute, but also performs almost unbelievable feats in computing 
weighted scores, averages, and so forth. A number of the leading test 
publishers are now preparing two editions of their tests, one for hand 
scoring and one for machine scoring. The expense of such a machine is, 
of course, too great for the small school or system, but every large 
system should have one. There are some smaller and less expensive 
machines which may be purchased for the same purpose, but they are 
also less efficient. There are already centers where good scoring ma- 
chines are located that offer scoring service to those who do not have 
it. The cost of scoring by machine, even when the papers have to be 
sent to a scoring center, is decidedly below that of employing clerical 
workers to do it by hand methods, and the accuracy is decidedly greater 
unless a great deal more is spent for checking. Another means of reduc- 
ing the scoring burden, and one that may be employed wherever a 
mimeograph machine or a printing press is available, is to have 
answers recorded by position (this is necessary for machine scoring 
also) and then run the answer sheets through the mimeograph or 
press and in some quickly and easily recognizable manner distinguish 
the correct answers. An even simpler way to accomplish the same end 
is to stack the papers, from fifty to one hundred in a pile, carefully 
and evenly, and then punch an awl through the positions where the cor- 
rect answers are. After any of these methods of marking papers, 
answers can be counted much more rapidly than if they have to be 
compared with those on a strip. 

New statistical procedures for interpreting and otherwise treating 
scores are being devised. Among the most prominent and promising of 
these is the factor method. It is, according to Thurstone, who is prob- 
ably its chief advocate, often being misused through the neglect of the 
five basic conditions to insure validity, but, in the hands of those who 
are competent to employ it, it is already yielding helpful new interpreta- 
tions. Another method is that of isochrons, chiefly urged by Courtis. 
It_is concerned with the growth curves of individuals. The same worker 
is also pointing out the need of measures of effort as bases of com- 
parison and interpretation of the scores we are now able to obtain. 
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Indeed he has said that at present our efforts to put meaning into the 
scores we secure, efforts which we call scientific deduction, would be 
better termed intelligent guesswork. 

We are at present moving both toward greater standardization and 
away from it. Such activities as those already mentioned along the line 
of unified or harmonized series of tests tend toward standardization. On 
the other hand, the use of tests has been, and I believe increasingly is, a 
force working to modify our curricular and instructional activities in 
the direction of less uniformity and greater attention to the individual 
needs and capacities. Tests are assisting us in our progress toward 
dynamic rather than static procedures, away from our former beliefs 
that through their use we could establish definite laws of learning and 
so forth. Thus they reduce teaching to a formula, in the direction men- 
tioned above, of recognition that each child constitutes an individual 
problem. They have served to reveal to us the many hitherto scarcely 
suspected factors of which we must take account, and the variation 
in these factors from person to person. 

Near the beginning of this talk I mentioned the increasingly 
critical attitude now prevailing in this field. One evidence thereof is 
that at last we have a real beginning in the provision of reviews of tests 
that are at least as critical as those of books which appear in our educa- 
tional periodicals. There has been a little of this, sporadic and fre- 
quently not highly critical, in periodicals, and somewhat more in some 
of the volumes dealing with tests in various fields, but the appearance 
of the 1938 Mental Measurement Yearbook, prepared under the editor- 
ship and chiefly through the efforts of O. K. Buros, marks a very decided 
step forward in this matter. 

Finally, you may be interested in knowing something of what 
teachers and schools are actually doing with tests and related activities. 
Therefore, I shall present a few of the findings which Lee and Segel 
published two or three years ago. Unfortunately, these are limited 
to those of high school teachers, but in many points they do not, I 
believe, differ greatly from those of elementary teachers. Some 1,600 
teachers, representing schools of various sizes widely distributed through- 
out this country, supplied the data. 

The median number of tests per semester which high school teach- 
ers give is approximately twenty, of which over sixty per cent are not 
over fifteen minutes in length. Foreign language and mathematics 
teachers give tests most frequently, whereas those in physical educa- 
tion, music, art, home economics, and industrial education give them 
least frequently. Commercial teachers use the most standard tests, with 
English and mathematics teachers also high in this respect, whereas 
the same groups that made use of the smaller numbers of tests in 
general likewise do so in the case of standard tests. Sixty per cent 
of all teachers reported that they give no standard tests, 25 per cent 
that they give one or two per semester, and 15 per cent, three or more. 

Despite my acquaintance with the wide spread of the short-answer 
test, I was surprised that the study showed practically three-fourths of 
the teachers giving it the chief place in their practice, as compared with 
one-sixth for the essay type and one-tenth who gave them equal place. 
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Fine arts teachers used the largest percentage of the short-answer 
type, those of mathematics and Latin the smallest. Of the varieties of 
this general type, the completion was reported as most often employed, 
followed by the true-false and the single-word answer. Most of the 
teachers included several varieties in their testing programs. In almost 
one-fourth of the instances no final examination appeared, this being 
most common in physical education and least so in Latin. The median 
length of the finals given was 90 minutes, but one-fourth of them were 
two hours or more long and almost another fourth less than one hour. 
Ten per cent of the finals were standard tests. 

The study reports data as to the uses of test results made by 
teachers, but they are quite extensive and I shall not give them other 
than to say that they constitute a wide range and that at least a fair 
fraction of high school teachers are alive to the many purposes tests may 
serve. More teachers would like to use standard tests, but probably do 
not do so because of the cost. A majority expressed the wish that 
authors of textbooks prepare chapter or other sectional tests to accom- 
pany the books. Almost three-fourths of the teachers would be glad to 
have intelligence tests administered to all their pupils, and still others 
to those who are problem cases. Likewise, two-thirds favored prognostic 
tests for all. 

To summarize briefly, some of the recent trends in the educational 
measurement movement are: 

1. A revival of interest therein, on a broader and more critical basis 
than formerly 

2. An enlarged group of objectives 

3. The construction of both standard and other short-answer tests to 
measure these larger objectives, especially those involving what are 
often referred to as the higher mental processes 

4. The increasing use of scales and questionnaires as well as tests 
to measure the outcomes of teaching 

5. The employment of anecdotal records for the same purpose 

6. The increasing use of measures in other lines of educational activity 

7. The professionalization of test construction and distribution 

8. The improvement of essay examinations 

9. More economical and accurate scoring methods and more significant 
statistical procedures for their interpretation 

10. Increasing attention to testing procedures on the part of teachers 
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Measurement of Speech and Hearing Defects 


RoBert L. MILISEN, Assistant Professor of Speech, 
Indiana University 


WHEN a child comes to our clinic, we not only attempt to examine 
him for his speech defect, but also for other special abilities and dis- 
abilities. It is our objective to know the whole child. Teachers are 
interested in the same problem, and yet they frequently do not under- 
stand the defective child as a total personality, because of a lack of an 
accurate and complete testing program. Up to the present date, most 
diagnostic programs have not considered speech or hearing deficiencies 
of sufficient importance to warrant testing for them, and the tests, 
when given, have been entirely unsatisfactory. 


One reason for this condition is the fact that people working in 
the field of speech use confusing classifications. Such terms as dys- 
arthria, dysphonis, dyslalia have very little reason for being used, 
since many simpler terms which are more descriptive could be sub- 
stituted. We must have a simpler method of classification so that it can 
be understood not only by the individual with special training, but also 
by every teacher with no special training. 

As I have moved around among the school systems of different 
states I have realized that teachers are poorly informed concerning the 
speech and hearing problems presented by their pupils. They are 
unable to understand the speech defective child and adapt their teach- 
ing to his needs because they are unable to tell the difference between 
types and because they do not understand the basic causes for these 
disorders. However, the fault lies not with the teacher, but with the 
method of classification, which is not simple or meaningful in itself. 
The medical profession has had a great number of terms which have been 
used to impress the patient. However, the trend in medicine is away 
from such practices, with the stress being placed on simple descriptive 
language which enables the layman to understand his disorder. Under- 
standing or insight is one of the most valuable mental hygiene tools. 

A simplified classification of speech and hearing disorders may be 
as follows: 


I. Stutterers, sometimes called stammerers, present disturbances in 
rhythm. Approximately 1 to 1% per cent of all school children 
present this disorder. 


II. Articulatory defectives are children who cannot form individual 
sounds correctly. Approximately 2 to 10 per cent of all school 
children present this disorder. 


III. Organic disorders such as cleft palate, paralysis, etc., result in 
speech defects in approximately “4 to % per cent of all school 
children. 
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IV. Hard-of-hearing children whose disorder is of sufficient magnitude 
to cause scholastic deficiencies encompasses %4 to 1 per cent of 
the school population. An additional 4 to 5 per cent of the children 
present a slight hearing loss which is important from the stand- 
point of medical treatment in order to prevent further deteriora- 
tion. 


Stuttering is a disorder of rhythm over which the child has very 
little voluntary control. Yet we, as adults, frequently insist that the 
child is able to control his stuttering when he wants to. We imply, either 
directly or indirectly, that his stuttering is a habit. If we avoid calling 
it a habit, we will avoid many disciplinary problems. A habit can be 
stopped as a rule, but the harder a child tries to stop stuttering, the 
more difficulty he has. Criticism or social disapproval of stuttering 
will frequently increase stuttering and bring with it serious discipline 
problems. However, the severity of overt stuttering varies greatly from 
one child to another and from one speaking situation to another. It 
must be remembered that a child does not have to be a severe overt 
stutterer to have a bad mental attitude because of the disorder. It must 
further be remembered that all these wild gesticulations, bodily move- 
ments, and grimaces are not actually stuttering spasms, but rather are 
the resultant mechanism brought about by repeated attempts to stop 
or avoid stuttering. 

This disorder generally begins at an early age, the average age 
of onset being three. The nature and development of the stuttering 
is usually as follows: 

At the beginning the average child is unaware of his stuttering 
and his spasms are usually in the form of easy, effortless repetitions of 
the initial sounds of words or the overlong maintenance of the first 
sound. This may be thought of as the primary stage of stuttering. 
If the young stutterer does not outgrow his disorder while in this 
stage, his speech defect will generally become worse. In some way or 
another, either intentionally or accidentally, criticism or social disap- 
proval will be attached to his stuttering. This will lead to the stutterer’s 
attempts to release or stop his individual spasms, which attempts gen- 
erally lead to more tension, more grimaces, and a prolongation of 
the length of the interruptions. As a result of this inability to release 
the spasm, the average stutterer begins to look ahead in an effort to 
anticipate his spasm words in order that he may put off the attempt 
to say them until he is sure he can say them without difficulty, or he 
may leave them out entirely. This act of anticipation and avoidance will 
sometimes help the stutterer in the immediate speech situation, but it 
fails a sufficient number of times to make him feel insecure regarding 
his speech. This feeling of insecurity results in a social morbidity, 
since he can avoid stuttering now only by refusing to talk in difficult 
situations, hence his refusal to recite in class. 

Articulatory disorders are found in a large number of individuals. 
The causes of this disorder are more easily understood and treated than 
the others. Defective articulation means misuse of a single sound and 
should not be confused with pronunciation, which refers to integration 
of sounds into a word. Most sounds are used in three positions in words: 
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in the initial, the medial, and the final position, as in rat, horse, and 
bear. There are only three ways that a sound may be misused. (1) The 
sound may be articulated indistinctly. In this case the sound does not 
resemble any of the sounds in our language. (2) A sound may be sub- 
stituted for another sound. For example, a common substitution is “wat” 
for “rat,” in which case a well-formed w is substituted for the r in 
“rat.” If the “r’” sound is substituted for in all three positions when- 
ever the individual tries to say “r,” his speech will be badly distorted. 
(3) A sound may be omitted from a word. For instance the word “rat” 
might be pronounced “at,” thus leaving out the “r” sound. 


The frequency and severity of this disorder in articulation varies 
directly with the age of the child, since approximately 10 to 15 per 
cent of kindergarten children have difficulty articulating their sounds 
correctly, and only about 2 to 3 per cent of high school seniors present 
the same disorder. Thus we can see that growth and maturation assist 
in correction, but too many children still present speech problems of this 
nature even after reaching maturity. For instance, a freshman girl 
twenty-two years of age came into the Clinic last fall. Her articula- 
tion was so indistinct that it was almost impossible to understand her, 
hence she would usually substitute writing for speech. She was very 
intelligent and, fortunately, she was a hard worker and ambitious to 
learn to speak. By the end of the semester she could form all her 
sounds correctly, and, when she talked slowly, she could integrate them 
into words which were pronounced distinctly. Another year of training 
will give her normal speech, but unfortunately her personality has been 
badly affected as a result of this speech impediment. The time for cor- 
rection is during childhood, not after adulthood has been reached. 


The understanding of organic disorders usually requires a physical 
examination designed to determine the amount of physical impairment 
which exists and to predict the amount of physical correction which may 
be brought about before the speech training begins. The results of surgi- 
cal repair in cleft palate, cleft lip, and dental and nasal abnormalities 
are very good. Sometimes surgical aid will help the paralyzed child; 
however, all children presenting a definite physical impairment should 
be under the observation of a physician during periods of treatment. 

Usually the children are of normal intelligence, but present abnormal 
personalities which match their abnormal bodies. Hence any good treat- 
ment will first consider the personality adjustments and later the speech 
retraining. The proper education of these children requires at least a 
special room with a well-trained clinician serving both as a teacher and 
a correctionist. 

The speech disorders presented by these children are directly 
resultant from the physical abnormality, either because the necessary 
speech organs are defective, or because the child thinks they are de- 
fective. This last aspect of the speech defects of these children offers 
us our first point of attack of the problem of correction. It is caused 
by a general misunderstanding of the use of various organs and gen- 
eral discouragement on the part of the child regarding speech, or, in 
some cases, plain laziness. 
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The ability to hear is a matter of degree varying from normal hear- 
ing to total deafness. A hearing loss may be the same in both ears, or 
greater in one than in the other. The hard-of-hearing child with an 
average loss of 15 per cent or more in both ears will probably have 
difficulty in school, yet the average teacher will not recognize the child 
as being hard of hearing, and the child will not know it himself: herein 
lies the chief educational problem. These children present social and 
scholastic difficulties which arise from inability to hear, but they are 
accused of being dull and are frequently referred to as behavior prob- 
lems. They are generally of normal intelligence and their personality 
disorders have resulted from the misinterpretation of a physical impair- 
ment by the other children and the adults who influence them. Give 
them a chance to understand their disorders and to achieve at a speed 
equal to their mental ability, and these behavior problems will dis- 
appear. 

Many more children will be found who have slight losses in hear- 
ing of less than 15 per cent. These children are potential school prob- 
lems, since their hearing may become further impaired. The chief pur- 
pose of a program for them is to give proper physical attention in order 
to prevent further deterioration. 

Such a short paper cannot present too complete a testing program. 
However, a few hints may be worth while. 

Stutterers should be studied for handedness development, since shift- 
ing a left-handed child to the use of the right hand may help to precipi- 
tate stuttering. The type of spasm pattern should be studied to deter- 
mine the level of development the stuttering has reached. This will 
not only be an aid in understanding the speech defect itself, but also 
in understanding the personal attitude that the stutterer has toward 
his disorder. Poor mental hygiene in the stutterer is synonymous with 
fear of and unwillingness to stutter, and thus avoidance mechanisms 
are built up. A careful case history is helpful in many ways, one of the 
most important being the matter of the causal factors leading to the 
stutterer’s poor mental attitude. 

Articulation cases should be examined primarily for the following 
underscored sounds: see, zero, rat, lad, thesis, those, chick, jump, fish, 
very, kat, goat, shoe, measure, “which, and want. Each sound, except 


those which are underlined twice, , should be tested in its initial, medial, 
and final positions in the word. The errors in articulation should be 
classified in one of three ways: (1) indistinct, (2) substitution, and (3) 
omission. The material should be presented to the child either in the 
form of printed material or pictures. 

Following the test one can determine the amount of improvement 
which can be expected from corrective work by saying the sound and 
having the child imitate it. This should also be done when the sound 
is presented as a part of a word. 

Children with organic disorders should have, in addition to the 
physical examination, a test similar to the one given to articulatory 
defectives. 

Hard-of-hearing children should receive: (1) a physical examina- 
tion with special emphasis on the ear, nose, and throat; (2) pitch audi- 
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ometer tests, which will determine the amount of hearing loss and the 
type of loss presented; and (3) a test similar to that administered to 
articulation defectives. 

In addition to this testing program each child should receive com- 
plete psychological and educational tests, personality examination, and 
tests for special abilities and disabilities, and if the child is old enough, 
he should be given vocational guidance tests. Accurate case history 
reports should also be obtained. 

For further reference to examination methods I would recommend 
Speech Pathology by Travis, Principles and Methods of Speech Correc- 
tion by Van Riper, and The Rehabilitation of Speech by West, Kennedy, 
and Carr. 
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Reliability of the Categorized and 
Uncategorized Mental Test' 


GRovER T. SoMERS, Professor of Education, 
Indiana University 


The Problem 


Tests of mental ability at the present time number in the hundreds. 
They have appeared from every section of the country, under many 
different labels and designed by many hands. Occasionally they present 
the appearance of hasty construction, reflecting an undue eagerness on 
the part of their makers to rush them into print in the hope that they 
will gain recognition and a niche in the market place. And many and 
large are the claims for the merit and value of the different types now 
available. Indeed, today our problem is no longer centrally concerned 
with the construction of more tests, but rather with the scrutiny of 
those already constructed, particularly with respect to their validity 
and reliability. We are primarily interested in instruments which do 
well and consistently that which they are designed to do. 

It has long been the experience of the worker with these tests that 
individual tests are superior to group tests, both theoretically and 
practically, as measures of the subject’s mental ability. So, when the 
problem is to ascertain the ability of some particular individual case— 
or a reasonably small number of cases—the individual test is unhesitat- 
ingly chosen for that purpose. It has likewise been found that, for a 
subject who has not had the opportunity to develop facility in language 
or who has been trained in a foreign language, the non-verbal and 
performance tests are superior to the verbal and language tests. They 
yield a truer measure of ability and they repeat more nearly from time 
to time the results obtained. But there are many patterns and character- 
istics of tests, the real merits of which are as yet undetermined and 
unknown. 

The relative merits of the several forms into which the test items 
are cast—the true-false, same-opposites, yes-no, multiple choice (single 
or plural response), best answer, completion, analogies, problems, 
scrambled sentences, classification, matching, etc.—are not known defi- 
nitely. 

Reasoning and logic would lead us to favor some of these types 
of items over other types. Unfortunately, however, it has been found 
that logic is not always a safe guide in the field of mental testing. 
Occasionally, if not frequently, experimental results are found which do 
not harmonize with “logical conclusions.” And science favors the fruits 
of carefully controlled experiments irrespective of the source issuing the 
logical pronouncements. The very nature of science emphasizes objec- 
tivity, experimental determination, demonstration, and verification by 
the same or by different workers carrying on under similar conditions. 


1Dr. W. G. Piersol, Superintendent of the Central Tabulation and Analysis Office of 


the Illinois Tax C ted in conducting this experiment. 
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Nor is it known in what form or order these different “castes” of 
items should be assembled and organized. The experimenters are un- 
certain—common practice to the contrary notwithstanding—whether it 
is better to continue to follow the pattern of Army Alpha, and group 
the items of like form into subtests, or to accept the Thurstone Psy- 
chological Examination IV (1919 Edition) as a model. Notable members 
of the family of Army Alpha are: National Intelligence Test, Terman 
Group Test of Mental Ability, and the American Council Test (1938 
Series). Some family members of the Psychological Examination IV are 
the Henmon-Nelson Test of Mental Ability (Grades III to VII, VIII to 
XII, and College), the Otis S-A Tests of Mental Ability (Intermediate 
and Higher), and the Odell Test of Mental Capacity. The vast majority 
of the workers in the field of test construction have chosen to follow the 
example set by the Army psychologists and mould a battery of tests 
with specified directions and examples for each type and with certain 
time allowance for each of the subtests. 

The argument most frequently advanced for the Army Alpha form 
of organization is that the testee’s reactions can be controlled with 
greater certainty and his abilities can be determined better thereby. 
Specific examples of what is required, with directions presented imme- 
diately for doing it, will insure the possible outcomes more nearly than 
directions and examples less close and less closely related to the im- 
mediate situation. Observations in the practical test room situation with 
younger subjects would seem to support rather strongly this argument.’ 
And logically this point seems fairly sound. 

On the other hand, those in favor of arranging and presenting test 
items in no particular system or order in the sense of groups of like 
items remind us that intelligence testing really demands that the in- 
dividual be bombarded with challenges from various angles and in no 
set or systematic order. We are told that to be able to meet a complex 
situation not only novel in character but also varying in form is one of 
the real tests of mental ability. That the ungrouped or uncategorized 
types provide the subject with a fuller measure of bombardment there 
is little doubt. The necessity, or even desirability, of this type of ex- 
perience in taking a test is subject to examination and analysis. 

The claim has been made, with seemingly little or no experimental 
or statistical evidence to substantiate it, that the uncategorized form 
is more reliable. The purpose of this paper is to report the results of a 
very limited investigation of this claim. The experimental and statistical 
procedures employed for determining the soundness or correctness of 
this claim will be detailed very briefly. 


The Procedure 


The Odell Test of Mental Capacity was mimeographed in two forms: 
the original (uncategorized) form and a categorized form with one 
subtest on each sheet. Directions for these subtests were altered from 
their original form only in such a way as to indicate that each page was 
to be timed separately. Mimeographing of both forms was done to insure 


_? The Henmon-Nelson Test of Mental Ability was given to third grade children in the 
University School of Indiana University. 
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comparability inasmuch as printing a categorized form was out of the 
question for our purposes. The item of expense was considerable even 
in this form. The original plan was to use two tests, the Terman Test 
of Mental Ability as well as the Odell Test of Mental Capacity, but funds 
were not available for the project and the magnitude of the study was 
reduced to fit the funds. 

The timing for the subtests was experimentally determined. A group 
of eight university students took each subtest individually, the time re- 
quired in each case being noted. The mean time required for completion 
of each subtest and the test as a whole was proportioned to the total 
time allowed for the test—this being thirty-five minutes. This timing 
was used with a larger class of university students. Six of the subtests 
contained twenty items each and the other two, fifteen each. An average 
of the mean number of items answered on each subtest was obtained 
and proportional corrections were made for the two shorter ones. On 
the basis of this averaging, the timing was corrected for the experi- 
mental work. A slight error entered, perhaps, into the averaging in 
the case of those who finished a subtest before time was up. It is be- 
lieved that error was negligible in most cases. This was borne out by 
the fact that, when we tabulated four sets of test papers subsequently 
for the number of items answered, in no case did the mean of any sub- 
test differ from the corrected mean of the means by as much as one 
point. As finally determined and used, the timing in minutes and frac- 
tions thereof was as follows: 


The Subjects 


The test subjects were drawn from both high school and college stu- 
dents by virtue of convenience and availability. Five college classes were 
used, ranging from freshman to junior standing (with an occasional 
senior), the majority being sophomores. These were taught by two in- 
structors, designated in this study as Instructor A and Instructor B. 
The high school groups represented several schools in the southern part 
of the state, sponsored by seven teachers from small town to large city 
high schools. Again convenience and availability determined very largely 
the students used. These teachers were taking a course in the Theory 
and Application of Mental Measurements at the time and were interested 
in availing themselves of the opportunity to work with tests and get 
ratings on their pupils. A total of 354 subjects were used—172 in col- 
lege and 182 in high school. 
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Half of the groups were given the uncategorized form first and the 
other half were given the categorized first, the purpose being to balance 
the practice effect. The retesting for the high school groups was done 
with the other form of the test two weeks after the first form was 
given. It was believed this interval of time was sufficiently long to 
render negligible the chance that the student remember the items and 
not sufficiently long to permit significant learning changes to occur. 
For the college groups the elapsed time interval between testings ranged 
from one week to three weeks, depending very largely upon the avail- 
ability of the groups used. (Enforced absence of the instructor from his 
campus classes because of off-campus calls was the controlling factor 
here.) The high school groups were tested by their regular teachers who 
were pursuing a course in mental measurement at the time and the col- 
lege groups were tested by the research student associated in the study. 
In cases in which an interval of one week was used between the two test- 
ings, it proved not to be long enough for the students to forget the 
items. This was true in the case of two of these college groups, as in- 
dicated by their expressed recognition of the second test as being 
different from the first merely in form. 

The scoring for the most part was done under supervision by a 
group of graduate students in the course in mental measurements. The 
tabulations of odd-even halves was done entirely by the research as- 
sociate and an assistant trained in this kind of work. Rechecks on papers 
taken at random from the whole lot gave a scoring error of approximately 
one point in every four or five papers. Care was exercised on every 
hand to insure accuracy in both th¢ scoring and the tabulation and it is 
believed that the work was free from inaccuracies which would affect 
or influence materially the results i” findings. 


Repults 


In reporting the results of the study we shall consider them from 
three points of view: (1) the means in score points of both categorized 
and uncategorized forms from each group—high school and college; 
(2) the correlations of categorized and uncategorized test ‘scores; and 
(3) the correlations of the two halves—odds and evens. By this we 
should be able to determine in part the answer to the questions of the 
relative reliability of the two forms and their proneness to higher scores. 

For the high school students (as shown in Table I) the scores on 
the second test tended to be somewhat higher. The differences of class 
means on the two forms ranged from .45 of a score point (tenth grade 
pupils in Lincoln High School) to 18.09 score points (Wadesville High 
School). Only one difference was less than nine points among the high 
school groups and only one was larger than the sigma of the first test 
scores (Table III). The practice effect is obvious, though the amount of 
it is extremely variable. 

For the college groups the differences in class means ranged from 
5.40 to 11.27 score points, with a general average difference of approxi- 
mately eight points (Table III). Here again the practice effect is ap- 
preciable, but it seemingly made little difference which form came 
first. 

In the three classes under college Instructor B, the juniors were 
used as the time trial group on the basis of whose papers the time revi- 
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sion was computed. However, the score differences do not seem large, 
and the time changes were small. 


TABLE I. DIFFERENCES IN ME ANS OF FIRST AND SECOND TESTS 


Categorized first Une ategorized first 

Lincoln: Grade 7 10.75 (20)*;| Reitz (Evansville).......... 9.81 (28) 
Wadesville..... 18.09 (48) | Lincoln: Grade 10....... .. 0.45 (22) 
Central (Evansville) 15.00 (32) | Bosse (Evansville)..... .. 11.83 (32) 
College: Instructor B College: Instructor B 

—Junior . 5.40 (40) —Sophomore..... 
College: Instructor B College: Instructor A 

—Freshman... .. 8.25 (25) —Sophomore.............11.27 (26) 


College: Instructor A 
—Sophomore..... 9.15 (39) 


*Numbers in parentheses indicate the number of cases in each group. 


It is evident that the differences tended to be larger for the high 
school groups when the categorized form was given first. One might 
hazard a guess that items later encountered in slightly different settings 
would be more easily and readily recognized if they were first met in a 
homogeneously classified group form. This, however, is but speculation, 
and guessing has but little place at most in the field of objective and 
quantitative fact. 

The percentile ratings on the college aptitude test together with 
the test data for two of the university groups are presented in Table II. 


TABLE II. MEANS OF TEST SCORES AND PERCENTILES ON 
COLLEGE APTITUDE TEST 


Percentile 

ategorized | ‘Uncategorized | on college 

test test laptitude test 
College A: Group 1 (39)... ' 114.19* 102.92 60.56 
College A: Group 2 (26)... 105.89 115.04* 64.44 


| 


The differ rences between einai test means (marked *), regardless 
of form (whether categorized or uncategorized), is 0.85; between first 
test means 2.97. Sigmas on the first test are 11.4 for Group 1 and 11.0 
for Group 2. The difference between the means is not statistically sig- 
nificant. It is interesting to note that Group 2 had a slightly higher 
mean on both tests (categorized and uncategorized regardless of the 
order in which they came) and percentile ratings. The differences of the 
means of these percentile ratings are not significant, being approximately 
half of the sigmas of the two classes. 

The rank-difference correlations between first test scores and mental 
percentiles were found for these two groups. Rho was 0.83 for Group 1 
and 0.86 for Group 2. Each of these is fairly high, and might indicate 
considerable possible relationship except for the non-valid character 
of the correlations. We are here attempting to relate two types of 
measures—magnitude measures and position or order measures—which 
cannot be correctly or properly interpreted by the correlation technique. 
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It is somewhat like correlating the length of dog’s leaps with rabbit 
ears and expecting to get a valid and meaningful result. 

The means, sigmas, and product-moment correlations of the scores 
on both forms for the different groups—high school and college—are 
given in Table III. Comparisons of the means and sigmas will show that 
Wadesville is the only instance in which the difference between means is 
larger than the sigma of the first form. The difference in means in this 
case is 18.12 and the sigma is 17.17. 

The coefficients of correlation between the two forms vary from 
0.645 to 0.909. From these data there seems to be no general tendency 
for the less advanced groups to show a higher correlation or a closer 
relationship, for both the highest and second lowest correlations were 
found for the college groups, and the lowest and second highest were 
found for the high school groups. The number of cases in this connec- 
tion is probably too small for valid conclusions. In one group there were 
only twenty pupils and in another twenty-two. 

The differences in means as shown in Table III are in no instance 
really significant. In only one group (Wadesville) was the difference in 
means a relatively large one. 


TABLE III. MEANS, SIGMAS, AND CORRELATIONS OF THE TWO 
FORMS—CATEGORIZED AND UNCATEGORIZED 


School Categorized Uncategorized| Correlation 
test | test 


CATEGORIZED FORM FIRST 
Lincoln: Grade 7 


W ille 
70.21 88.33 842 (48) 
Centra! (Evansville) 
Mean....... 78.75 93.75 | .754 (32) 
| 16.58 | 13.97 
College: Instructor A—Sophomores| 
College: Instructor B—Juniors 
College: Instructor B—Freshmen | | 
UNCATEGORIZED FORM FIRST, 
Reitz (Ev 
Mean....... 88.93 79.12 .690 (28) 
Lincoln: Grade 10 
Mean....... 57.95 57.50 645 (22) 
Bosse: (Evansv ille) 
76.44 64.61 899 (32) 
S.D.. 18.61 18.18 
College: Instructor A—Sophomores 
Mean....... 115.04 105.89 .823 (32) 
S.D.. 15.50 11.00 
College: Instructor B—Sophomores 
Mean.. .| 110.88 104.52 .909 (37) 
8.D.. 15.90 15.57 


| | Wid 
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In order to determine any measurable and significant differences in 
the reliability of the two forms, the chance halves of the scores for each 
individual on each form were compared and correlated. We could see 
no reason for using the Spearman-Brown formula to determine the re- 
liability coefficient of the full-length test and that was not done. This 
was done since the purpose here is to determine what this test really 
does as given in the two forms—categorized and uncategorized—under 
the conditions which we have endeavored to describe. The correlations 
for the chance halves are presented in Table IV. 


TABLE IV. CHANCE HALVES CORRELATIONS OF TWO TEST FORMS 


School Categorized | Uncategorized 


CATEGORIZED FORM FIRST 
Lincoln: Grade 7 
Wadesville. . 
Central (Evansville) . 


.044 (26) 


.022 (51) 
| .753 (35) 


855+ .039 (22) 
-722+ .046 (48) 
-742+ .053 (32) 


College: Instructor B—Juniors... . | .882+ .027 (32) | .882+.030 (25) 

College: Instructor B—Freshmen....... | 864+ .027 (43) | .844+.031 (40) 
UNCATEGORIZED FORM FIRST | 

| .082 (31) | .826+.040 (29) 

.572 (27) | .585+ .089 (24) 

Bosse (Evansville)................... .796+ .044 (32) | .632=.064 (38) 


College: Instructor B—Sophomores 


893 + .020 (42) 


(867+ 029 (37) | 


Among the three high school groups having the categorized form 
first, one shows a difference of seven times its PE in favor of the 
categorized form (Wadesville). And among the three school groups 
having the uncategorized form first, one shows a difference of six and 
one half times its PE in favor of the uncategorized form and another a 
difference of three and three fourths times its PE in favor of the 
categorized form. The other differences are not large or significant. 
The differences among the college groups are in all cases small and 
relatively insignificant. 

Table V gives the values received for the various classes considered 
in this study, when our test for significance of differences (if difference 


divided by PE aig is greater than four, the difference is significant) is 
applied. 


TABLE V. SIGNIFICANCE OF DIFFERENCES BETWEEN CATE- 
GORIZED AND UNCATEGORIZED TEST SCORES 
School Value | School Value 
Lincoln: Grade 7.. | Bosse .-2.12 
Central (Evansville). . College Sophomores.............. .50 
Reitz (Evansville)....._. || College Freshmen................ .74 


Lincoln: Grade 10................ 


15 |} 
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Applying the criterion of four as a value for significance, we find 
that no one of the differences can be considered as statistically significant. 
In only two cases (Wadesville and Reitz) was there even an approxima- 
tion to significant differences revealed between the two forms. 

The uniformly high correlations for the three college classes seem 
noteworthy. If the Spearman-Brown prophecy formula be applied, the 
reliability of the lengthened test would be higher than the highest of 
the correlations between the two forms. This holds true for even the 
lowest of the chance halves correlations. But this is to be expected, for 
the chance halves reliabilities are higher than those of either of the 
other two methods of determining reliability. 


Summary and Conclusions 


We might review briefly the chief points of this study as planned 
and carried out and the essential facts that seem to emerge from the 
findings as reported. From a consideration of the data gathered and 
analyzed, the following summary and conclusions seem warranted. 

1. The Odell Test of Mental Capacity was mimeographed in two 
forms—the original uncategorized and a categorized form. The test was 
given to a number of high school and college classes, totaling 354 stu- 
dents, half of whom were given one form of the test first and half the 
other form first. 

2. For the high school groups the means of the second form given 
were always higher than the means of the first form, although the 
differences between the means exceeded the sigma of the first test in 
only one instance. The differences in the means were in no case really 
significant. The practice effect was obvious in most groups and classes, 
but this fact is not new or different from what we find even in the case 
of testing with equivalent or duplicate forms of a test. 

3. The differences between the means were considerably larger 
when the categorized form was given first. Just why this was true we 
have no means of knowing. It seems reasonable, however, that the 
setting among elements similar to itself was conducive to greater “im- 
pression effect” than when it was among unlike elements which made it 
necessary for mental activities to shift more frequently from one type of 
item and activity to another. 

4. For the college classes the scores were always higher on the 
second form. None of the differences, however, were significant ones. 
The practice effect was apparent here as with the high school groups— 
and probably for the same undeterminable reason or cause. Some mem- 
bers of the two classes who were given the tests on successive weeks 
indicated recognition of the materials by asking whether the second 
test were not the same test in a different form. 

5. The correlations between the two forms of the test range from 
.645 to .909 with no discernible tendency for the high school groups to 
show higher in these relationship measures than the college groups, nor 
for the groups from the smaller high schools to be appreciably different 
in this respect from those from the larger high schools. There is no 
evidence here that either form—categorized or uncategorized—has 
especial merit over the other form. 
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6. The chance-halves correlations for high school classes range 
from .566 to .855 and for the college groups from .844 to .893. These 
correlations, especially with college groups, point even more strongly to 
the fact that one form has no particular merit over the other so far as 
measures of reliability would indicate. There are no significant dif- 
ferences between the means of the two correlations for any class. 

7. So far as statistical analysis of our data goes, there is no 
significant difference in reliability between the two forms of the Odell 
Test. This is shown by the study of the differences between the means 
of the two forms, the correlations between the two forms, and the cor- 
relations between the chance halves of both forms. 

8. As the character of the high school groups varied considerably, 
it is probable that the lack of significant differences is a true lack for 
high school students. And to some degree the same might be said of the 
college groups inasmuch as three different classes, freshmen to juniors, 
were represented. 


Suggestions for Further Study 


1. It would be interesting to extend this study to include the test- 
ing of grade groups from the eighth grade down to the third grade. It 
may be that the lower grade or age groups would not succeed as well 
on the uncategorized form as they would on the categorized form. This 
is yet to be determined by careful and objective experimental study. 

2. It would also be interesting to dissemble a categorized test, such 
as the National Intelligence Test or the Terman Group Test of Mental 
Ability, and work it into an uncategorized form and give it as the Odell 
Test was given to see what the results might be. It is possible that 
the findings would be somewhat different from those reported in this 
study. 

3. A more extensive study should be made of the relative validity 
of the two forms. Truer measures of mental ability than a college apti- 
tude test score should be obtained as a criterion with which to com- 
pare the scores of the categorized and uncategorized forms. This should 
be done for age and grade groups throughout the range for which these 
two test forms are constructed and used. It is only by slow accumulation 
of fact that we can determine whether the instruments we are con- 
structing and applying are suited to our purposes and will assist us in 
reaching the goals that we are striving to attain in our educational 
endeavors. 


| 
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Some Assumptions Underlying the 
Educational Measurement Movement 


CHARLES W. ODELL, Associate Professor of Education, 
University of Illinois 


IN dealing with the topic, “Some Assumptions Underlying the Edu- 
cational Measurement Movement,” I wish to state and examine critically 
certain explicit assumptions which have been set forth as constituting a 
sort of creed of the educational measurers and also others. This creed 
is usually implied rather than expressed, and many of the procedures 
actually employed are based upon it. 

I have never heard of any universally accepted, complete set of 
assumptions underlying this movement. In the first chapter of his 
volume, How to Measure in Education, which appeared in 1922, W. A. 
McCall presented a group of fourteen theses, as he called them, which 
represented his idea of the bases upon which the movement rested. 
Inasmuch as he had been a student under Thorndike, later a colleague, 
and was in close touch with the Teachers College group of leaders in 
this field, who were unquestionably the outstanding group in educational 
measurements, the assumption that his theses represented fairly well 
the general views of that group appears justified. This belief may be 
supported by the expressions, in writing and public utterances, of the 
group. Therefore these fourteen theses may justly be regarded as the 
nearest approach to the fundamental creed of workers in this field that 
had appeared up to that time. 

Since that time nothing has appeared that was really comparable 
until in McCall’s recent book, entitled Measurement and dated 1939, he 
again began with a similar set, which he presents under the title, “A 
Philosophy of Measurement.” There are seventeen theses instead of 
fourteen, many of them the same as some of the original ones. It is 
probably true that, with the growth of the movement, greater diversity 
of opinion and belief has developed and therefore these seventeen theses 
do not approximate general acceptance so nearly as did the first set. 
But within the scope of my knowledge there is no other formulation 
that even approaches doing so. I wish, therefore, to examine briefly 
this recent attempt to provide a theoretical basis for the educational 
measurement movement. 


Thesis 1. The Ultimate Test of All Things Is the Happiness They 
Yield. 


McCall asserts that the greatest good, the ultimate purpose of 
society, is to produce happiness, which is not for the individual alone but 
for all. Neither is it mere pleasure, as the term is often used, or is it 
something for the present only, as it may be largely for the future. He 
does not attempt to connect educational measurement directly with 
happiness except as he states that a test of happiness is necessary to the 
satisfactory solution of educational problems. I suppose that practically 
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all of us are willing to accept happiness, in this broad interpretation, 
as the supreme goal of humanity, and perhaps accept also the need for 
means of testing it, but to use the language of the street, “Where does 
it get us?” in our measurement. On one side of the family, at least, I 
inherit the right to live to a ripe old age, but I have no hope of seeing 
any real progress in testing happiness in my time. 


Thesis 2. It Is Proper for Most Tests to Measure Secondary Traits. 


Under this, McCall points out that,’since happiness is brought about 
in large measure by means of indirect procedures and activities, it is 
appropriate that these be tested. In setting them up as criteria, how- 
ever, we must be careful not to consider them in isolation from the final 
goal, for this may cause them to lead to undesirable results. Here the 
approach to measurement is closer and is in accord with the present 
tendency to discard the old viewpoint of the subject for its own sake 
rather than for that of the child. 


Thesis 8. The Alleged Conflict Between Measurement and Gestalt 
Psychology Is Equivalent to the Conflict between Secondary Criteria and 
the Ultimate Criterion. 


The individual’s personality and capacity is more than the sum of 
his separate characteristics and abilities, being rather the resultant of 
their interaction. Therefore a test score of an individual cannot be 
correctly interpreted as a measure of an isolated, independent char- 
acteristic, but rather as one of a phase of the reaction of the individual 
in its relationships with those on other tests. 


Thesis 4. Measurement Is Essential to the Maintenance and In- 
crease of Each Generation’s Capacity to Learn. 


McCall claims that Western population is declining in both quantity 
and quality and that, unless consciously arrested, present tendencies 
will destroy our civilization. Therefore he urges the necessity of measure- 
ment to discover why our population is being replenished from the 
lower rather than the upper ability ranks and advises action to reverse 
this tendency. Such action must be based upon the determination of who 
are in the upper ranks, hence the need for measurement. If his premise 
be granted, his conclusion seems logical, and even if the former is not 
sound, improvement requires such measurement. 


Thesis 5. Tests Perform a Vital Service to Governments. 


This is an undoubted fact, but scarcely appears to belong in this set 
of theses. 


Theses 6 and 7. Whatever Exists at All, Exists in Some Amount, 
and Anything That Exists in Amount Can Be Measured. 


These two, which were the first two of McCall’s original fourteen, 
are given together because of their close connection and because they 
are often combined. What we think of as qualitative differences can be 
expressed as quantitative concepts, more of certain characteristics 
and less of others. We are now measuring many things that only re- 
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cently were considered incapable of measurement, and there is no in- 
trinsic reason why we cannot, in time, measure anything. However the 
period of time involved before many things are measured with sufficient 
validity to yield useful results will undoubtedly approach eternity as a 
limit. 

Thesis 8. Measurement in Education Is in General the Same as 
Measurement in the Physical Sciences. 


The acceptance of this thesis depends largely on the interpretation 
of the phrase “in general.” Both types deal with physical manifestations. 
Moreover, whenever measurement goes beyond mere enumeration it in- 
volves features which lead to the next thesis. 


Thesis 9. All Measurements in the Physical Sciences Are Not 
Perfect. 


Here McCall seems to have understated the case, since no measure- 
ments which involve more than counting are perfectly accurate. We 
may measure time to the thousandth of a second, length to the millionth 
of an inch, but there still remains a margin of possible error. The causes 
are essentially the same as those for errors in mental measurement— 
variations in the thing being measured, in the instrument being em- 
ployed, and in the way it is used. We are now measuring many things 
in this field more validly and reliably than many in the physical sciences 
were once measured, and even more so than some are now. 


Thesis 10. Measurement Is Indispensable to the Growth of Sci- 
entific Education. 


If this is not true, education is a glaring exception. Without 
measurement, art and architecture, physics and chemistry, biology and 
medicine, mechanics and engineering, telegraphy and telephony, radio 
and television, travel and transportation, agriculture and forestry, eco- 
nomics and sociology, indeed all the material phases of our present 
civilization and culture, would never have developed, and without meas- 
urement our society would quickly be reduced to a state little if any 
above barbarism. 


Thesis 11. Measurement in Education Is Broader Than Educational 
Tests. 


This is self-evident. We are measuring, and need to measure even 
more, numerous features connected with our schools and their activities 
other than those for which tests, in the narrow and technical sense, 
are appropriate. Score cards, scales, rating devices, questionnaires, and 
many other instruments are necessary. 


Thesis 12. To the Extent That the Pupil’s Initial Abilities or 
Capabilities are Unmeasurable a Knowledge of Him Is Impossible. 


Unless we know with what we are working and its significant char- 
acteristics we cannot hope to accomplish anything of value. Here also 
the field of education is no different from any other. Unless we know 
our materials we work blindly, by chance rather than by plan. 
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Thesis 12. To the Extent That Any Goal of Education Is In- 
tangible It Is Worthless. 


In order to make use of a goal as a means to increase the efficiency 
of the educational, or any other, process, we need to know what it is 
worth, what and where it is, and whether or not progress is being made 
toward it. Each of these requires measurement. 


Thesis 14. The Worth of the Methods and Materials of Instruction 
Is Unknown Until Their Effect Is Measured. 


This is closely connected with the last two. We must know where 
the pupil was at first, also how far toward the goal he has advanced, 
in order to evaluate whatever influences have been brought to bear 
upon him. In this also we are but accepting the conditions that obtain in 
the physical realm. 


Thesis 15. Measurement of Achievement Should Precede Super- 
vision of Teaching Method. 


I think I do not differ materially with McCall concerning his point 
here, but I should state it otherwise by substituting “accompany” for 
“precede.” In any event, measurement is essential to supervision. 


Thesis 16. Measurement Is No Recent Educational Fad. 


Although we often speak of the recent educational measurement 
movement or of the scientific movement in education as something 
dating back only a few years, tests and other measurements have doubt- 
less existed ever since education began. We are merely refining them 
and enlarging their scope. 


Thesis 17. Teachers Should Codperate in All Testing and Should 
Be Allowed to Administer and Score Intelligence and Educational Tests 
and Interpret Results. 


In his use of “all,” McCall appears to be somewhat too inclusive, 
but I certainly agree that, as a general rule, teachers should be codperat- 
ing participants in educational measurement. Unless they are, many 
of the highest possible values will be attained in small measure. 

As should be evident by this time, I believe that McCall’s theses are, 
on the whole, valid and constitute an approximately satisfactory basis 
for the educational measurement movement. To most of them there 
seem to be no valid basic objections, since both fact and logic sustain 
them. Now, having accepted with slight modification this set of quite 
general assumptions, I wish to proceed to examine some others which 
I shall not deal with so favorably. Most of those which I shall mention 
are implicit in measurement procedures and interpretations rather than 
explicitly stated. Many of the leaders in the movement have recognized 
the elements of uncertainty in these assumptions and I would not at all 
claim the virtue of originality for suggesting them. Indeed, on the other 
hand I should acknowledge my debt to R. W. Tyler, B. O. Smith, L. T. 
Hopkins, and others. You should not, however, hold any of the men 
just named or any others responsible for the exact points of view I shall 
present. 
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Before proceeding to consider these assumptions, I wish to make it 
clear that I speak as a friend of and believer in educational measure- 
ments, not as an outsider who would destroy the movement and dis- 
courage its supporters. Criticism should come from within, from those 
who are interested in making measurements as helpful as possible and 
who are willing to work toward that end. Despite the lack of evidence 
to support some of the widespread assumptions which I shall mention, I 
have faith in beth the present value of educational measurements and 
the probability that we shall so improve them as to secure constantly 
increasing value therefrom. 

My first point is that the validity of our instruments is, in gen- 
eral, logically polar. This means that we begin with the assumption 
that learning is quantitative and that responses on a test are propor- 
tionally related to total achievement. We then construct measuring in- 
struments on that assumption, and finally justify it by the results ob- 
tained. We are not unique in employing this method, as it has very 
frequently been that of science. To the extent that the consequences 
are helpful, to that extent we may consider our assumption valid. We 
need, however, to make use of other means of corroborating validity, if 
any are available, but in the meanwhile we should proceed to employ 
instruments validated as just suggested rather than wait for those con- 
cerning whose validity we can be absolutely sure. In so doing, however, 
we must endeavor to keep the construction, application, and interpreta- 
tion of the results of our instruments consistent with the best psycho- 
logical and biological principles, and constantly test the conclusions and 
relationships inferred from them by the consequences which follow 
their application. 

Many of our instruments, especially those designed to measure 
achievement in the school subjects, are at fault with respect to the first 
of the two points just stated. I refer especially to the fact that they 
are based upon a conception of education that was dominant about the 
time that standard tests began to appear. This involved the belief that 
the major purpose of education is to transmit a social heritage organ- 
ized in subjects, in which uniform learning of a body of minimum 
essentials should be required of all. I do not believe modern educational 
theory has totally discarded the idea of minimum essentials, nor that 
it should, but it has certainly accorded such importance to the making 
of provisions for individual differences as is inconsistent with the previ- 
ously held concept just mentioned. Unfortunately our tests have not 
kept up with the change, doubtless because it is easier to measure out- 
comes in agreement with the former theory. 

Despite the warnings and examples given by various workers in 
this field, we are still far too prone to assume that a test really measures 
what its title or the claims of its author and publisher assert. This may 
seem rather elementary, but I am impressed of its truth over and over 
again, not merely by the assumptions of my undergraduate students, 
but by those of experienced teachers, supervisors, and administrators 
who have had graduate work, and even by those of many who contribute 
to our educational literature. I still encounter such persons who have 
no doubt that a test composed of such elements as 3 + 4, 8 + 1, 6 + 2, 
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5 + 5, 4 + 7, and so on, with a time limit so short that pupils do not 
finish, measures speed and accuracy of addition for all pupils, although 
quite a number of years ago it was shown that a considerable per cent 
of pupils who are good in addition can add single digits more rapidly 
than they can write the answers. I still encounter such persons who 
refer to the ordinary type of group intelligence test as measuring almost 
entirely native capacity, although the great effect of training upon 
scores thereon has been demonstrated. 

Perhaps the most common error of interpretation of the sort just 
referred to is the assumption that a particular test yields a score that 
is a valid measure of ability in the whole subject or course of which it 
tests a part. We frequently speak of ability in reading, in language, in 
French, in physics, in any school subject, as if it were something that 
could be measured by a single test of an hour or possibly two or three 
hours, in length. Practically every subject involves a number of major 
objectives, which are subdivided into a host of minor ones. Even allow- 
ing, as is of course necessary, that they are measured by the process 
of sampling, the number of exercises and elements which must be in- 
cluded in a test that is to measure the whole course is so large as to 
render its completion within a reasonable time limit entirely impossible. 
Moreover, even if it were possible for testees to complete such a test, 
there would be no known means of weighting its various parts so as to 
insure that the total score was a valid measure of the desired whole. 
This is all the more true because experimental evidence shows that the 
relationships among pupils’ abilities in the various phases of a subject, 
although usually positive, are far from perfect. In fact they are often 
so far from perfect that the coefficients of correlation between knowl- 
edge of chemical formulae and skill in manipulating laboratory ap- 
paratus, for example, or between ability to give common grammatical 
rules and to write forceful English, are often well below .50 and some- 
times quite close to zero. Furthermore, the measurement of what is 
frequently a mere fragment of subject-matter in isolation is by no 
means certain to yield a result that agrees closely with that obtained by 
measuring it in place with the other elements with which it is associated 
in actual use. The situation is well illustrated by the old story you 
have all heard of the boy kept in after school and directed to write 
“I have gone” fifty times so as to impress the correct form on him. 
The teacher left the room before he had completed the task, and when 
she returned found him missing, but on her desk the fifty copies cor- 
rectly done and the added words “I have done what you told me to do, 
so I have went home.” 

From this, two corollaries suggest themselves to me. One is that 
when we are making use of what we often call general survey tests— 
that is, tests which cover a wide range of material without concentrating 
on any portion thereof—we should not fall into the habit of considering 
our results as measures of the whole field covered. They are rather a 
sampling thereof which is probably proportional to, and therefore an 
index of, the desired outcomes of the subject or group of subjects at 
issue, but no more than that. 
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The second is that most of our tests should not attempt to cover 
a very wide range of subject-matter. If they are primarily for the pur- 
pose of determining marks, they should do so; that is, they should be 
of the general survey type just referred to. It is, however, regrettable 
if the determining of marks becomes the major objective of testing. 
Rather than this, it should be to furnish such information about pupils 
and the effectiveness of the conditions with which we surround them as 
will enable us to provide the optimum assistance in their development. 
For this purpose we are most in need of valid measures of particular 
phases of their reactions or, in other words, of the information yielded 
by diagnostic tests. Such tests need not test narrow phases so much as 
phases that are identified definitely by the measures obtained. Only by 
the use of such instruments can we secure an approximately exact 
knowledge of what changes are taking place in our pupils and what 
consequences ensue from our procedures and materials. 

Another assumption often made concerning validity is that there is 
a satisfactory criterion with which a test can be compared, usually by 
correlating scores on the two, to determine its validity. In many cases 
coefficients of correlation between scores on the test in question and 
some other tests or data are given and denominated coefficients of 
validity. In most instances of this sort the term is at best only partially 
justified. In the early days of the testing movement, according to some 
of its critics, achievement tests were thus validated by correlating the 
scores thereon with teachers’ marks, and then later, after the tests 
had been more or less accepted, the teachers’ marks were declared to 
be of little worth as measures and the tests were declared to be of 
high merit. There was often too much truth in this accusation for us to 
feel complacent. It is true that, if two or more tests made independently 
by as many competent individuals or groups of individuals and intended 
to measure the same thing yield scores that correlate highly, this fact 
may be offered as presumptive evidence that both do measure it, but 
it is by no means proof that they do. The nearest approach to a satis- 
factory criterion for most tests is probably a very extensive series of 
tests over the same content, but this has been employed in only a 
very small minority of cases, in so far as present standard tests are 
concerned. 

In practically all cases the tests we administer measure indirectly. 
In other words, they secure, for purposes of evaluation, not the actual 
outcomes which are the desired results of education but only some 
supposed indices thereof. We can secure samples of the handwriting 
of pupils produced under essentially normal conditions of the functioning 
of that ability, and then rate them by means of a handwriting scale. 
We can, at the cost of considerable effort and time, observe, record, and 
evaluate the spoken English of pupils as they use it in ordinary life 
situations. We can, in some of the school subjects, have pupils produce 
objects such as drawings, articles of clothing, pieces of furniture, and 
so forth under conditions that may approximate real out-of-school 
situations nearly enough that we need not worry about the difference. 
In the vast majority of cases, however, we either cannot utilize such 
methods or find them so difficult that we do not, and instead we rely 
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upon some form of verbal response which we accept as evidence of the 
ability we desire to measure. The results vary in validity for the pur- 
poses desired. The ability to select the best one of five suggested 
synonyms or definitions for a word does not guarantee the ability to give 
a correct definition of the same word without assistance, nor does either 
insure that the correct meaning will be attached to the word when it 
is encountered in context or that it will be correctly employed in speech 
and writing. Likewise, the ability to state and explain the principle of 
the lever does not insure that the pupil who possesses it can use such 
an instrument in the most efficient manner when he is faced with a 
practical situation demanding it. It is, of course, often necessary to 
resort to such indirect methods if any measurement is to be done, if 
tests are to be practicable for more than experimental application. 
However, such measures should be validated, whenever possible, by 
comparison with the actual abilities of which they are intended to be the 
indices. 

Not long ago I referred to the necessity of sampling. We com- 
monly assume that a test provides an adequate and representative 
sampling of pupil responses to all the situations which are involved 
in the given objectives. In order to do this the test must sample satis- 
factorily both the situations referred to and the pupils’ responses 
thereto. The sampling of pupils’ responses, especially, is very often 
inadequately done and overlooked in considering the validity and relia- 
bility of the test. For example, a test containing all the possible com- 
binations of the single digits, arranged for addition, can be easily con- 
structed and provides a complete and therefore satisfactory sampling 
thereof. Unless a pupil knows all the sums quite well it is improbable 
that he will respond to each combination on a repetition of the test in 
exactly the same way as he did on the first application. Some of the 
problems which he had correct on the first paper will probably be in- 
correct on the second, and vice versa. Since we cannot, except in rela- 
tively rare instances, make and give tests long enough to insure perfect, 
or even approximately perfect, sampling with regard to these two fac- 
tors, we should, by statistical methods, determine the degree of reliabil- 
ity of the sampling contained in a test and interpret the results secured 
in the light of such information. Moreover, we should not charge the 
variability in pupil response to the unreliability of the test any more 
than we blame the fact that a person’s height is not always found to 
be just the same on the meter or yardstick employed to determine it. 
This consideration holds with especial force when the objective of 
measurement is something that is largely concerned with attitudes, in- 
terests, or ideals rather than something more completely intellectual, 
since pupils’ reactions are not as stable in this field. 

Certain other assumptions are connected with scoring tests. We 
commonly assume that the method of scoring employed with a particular 
test is such that it will give evaluations of pupils’ responses that are 
appropriate and valid for the purposes of the test. Only rarely are 
other methods of scoring tried and the best selected. If the answers are 
not objective and of such a nature as can be scored by the use of a 
key, the ratings given by different equally competent persons will not 
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be just the same. It becomes necessary to establish some criterion 
by which their validity can be judged. There is ordinarily none that 
is entirely satisfactory. In most cases the best that can be done is to 
compare them with the average or composite ratings of a large num- 
ber of scorers. 

Even if there is a scoring key by which all possible answers can 
be classed as either right or wrong, and by which all competent scorers 
may arrive at the same score, except for errors of carelessness, sub- 
jectivity is not totally eliminated, as we often assume. There may be, 
among those qualified to express their opinions, differences as to what 
some of the correct answers are, differences which have been neglected 
in preparing the scoring key. Even if this is not true, the weighting 
or number of points to give each element has been subjectively de- 
termined. On most standard and other short-answer tests, one point is 
given on each element. The usual justification for this method is that 
it greatly facilitates scoring and that, if the test is fairly long, the 
correlation between scores so determined and those based upon a system 
of unequal weights is so high that it makes no material difference which 
one is used. This is, in general, supported by the evidence, but particular 
cases can be found in which it does not hold. Moreover, if some system 
of unequal weighting is employed, it has as its basis subjective opinion, 
even though there may be elaborate statistical procedures employed in 
its final determination. There is, for example, far from universal agree- 
ment on the relative significance of incorrect answers as compared with 
failures to give any at all, also on whether difficulty, frequency of use, 
importance as judged by experts (here another subjective factor is in- 
troduced), or some other criterion should determine relative weights. 

We have often gone to great pains to secure scoring systems which 
made use of standard units, that is, units which may be reported to 
others and convey the same meaning to those to whom they are re- 
ported as they have conveyed to those who have reported them. In 
practically all cases they are based upon as yet unproven assumptions. 
For example, the approved method of determining units of difficulty is 
to translate per cents of correct or incorrect responses into standard or 
median deviation units on the normal curve, after which equal dis- 
tances in deviation units are considered equal with regard to the char- 
acteristic or ability being measured. This procedure has been very con- 
venient and has undoubtedly served to yield valuable information, but 
it cannot be said to be completely proven. As a second example, a 
difference of one year of mental age at one place on the scale has fre- 
quently been dealt with as if it were equivalent to that of one year at 
another point, an interpretation which is not justified in terms of other 
possible measures of the same thing. Thus in general our standard units 
are scarcely fully standard, but we often employ them as if they were. 
In this connection we should note that many of the units used in the 
physical sciences are also subject to this same qualification, although 
in ordinary use we do not think of such a limitation. 

I should like to consider further the method of determining score 
values by the normal curve because it has been employed so widely, 
both in the construction of scaled tests and in that of quality or merit 
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scales. The usual method of determining the merit of the specimens on 
a quality scale is to have them compared with one another by a num- 
ber of judges, the results expressed as per cents of better and worse 
judgments for each pair of specimens, these per cents assumed to cor- 
respond to areas under the normal curve and the linear or baseline dis- 
tances corresponding to the areas found and taken as the values or 
ratings desired. This procedure assumes that the judgments of dif- 
ferences may be treated as errors of observation and that equal base- 
line units represent equal units of the quality which it is desired to 
measure. The experiments of Cattell and Fullerton which gave rise 
to the method being discussed were made with such phenomena that 
differences in the things measured were evidently purely quantitative. 
For example, in one of their experiments the variable involved was the 
linear distance between two uprights. No possible question of quality 
was concerned. This same situation does not hold, however, in the con- 
struction of educational scales. Even though we accept the dictum that 
quality is just greater or lesser quantity, we are not justified in regard- 
ing the two situations as the same. The merit rating given a specimen 
of handwriting, a drawing, a composition, or some other similar pupil 
product is not based simply upon the recognition of more or less of a 
single recognizable characteristic, such as length in the experiment just 
mentioned, but upon the judgment of an entity which is the complex of 
many factors each of which may be present in varying amounts. There- 
fore the judgments are not all based upon the same elements and, ac- 
cordingly, they are not purely quantitative. 

The second of the two assumptions upon which the method under 
consideration rests, that equal portions of the baseline of the normal 
curve represent equal quantities of the characteristic to which the area 
corresponds, also is not proven. The assumed equality is merely one of 
definition, not of known fact. Its use has served to convey an illusion 
of scientific precision and validity not justified by the evidence so fai 
made available. 

In much of the literature on educational measurements, methods of 
teaching, and other related phases of education, we find great emphasis 
on the position that the one chief objective of measurement should be 
the development of the child, and that this requires the improvement 
of instruction. With this I am in hearty accord, but with one of its 
frequent consequences I wish to disagree. I refer to the use of tests to 
attempt to measure changes so small in comparison with the units em- 
ployed on the tests as to render such efforts futile. The scoring units of 
many of our tests are so coarse that the normal gain from one year to 
the next, or from one grade to the next, is represented by only a few 
points increase in score. Moreover, the probable error of that score may 
be so large as to equal a considerable fraction of the gain from year to 
year. Despite this condition, many of those interested in utilizing test 
results for the improvement of instruction have conducted experiments 
and drawn conclusions based on such short periods of time, and such 
small differences of scores have resulted, that little confidence should 
be placed in their findings. The situation is much the same as if some- 
one were trying to measure the gain in time of an expert sprinter by a 
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watch which measured nothing smaller than a second, or the gain in 
height from month to month with a ruler containing no divisions smaller 
than an inch. The units employed must be fine enough to yield reliable 
measures and reliable differences between measures for the periods under 
consideration. In the cases of many of the characteristics we desire to 
measure, this requirement means that tests must either be very long, 
or else they must be designed to cover only a quite limited range of 
ability therein. 

Within the last few years those of us who have read much of the 
literature of this field or have heard many reports of research have 
heard a great deal about experimental coefficients, critical ratios, errors 
of differences, and other similar statistical devices. I certainly have no 
quarrel with their use; indeed, they should have been employed in con- 
nection with the interpretation of many data where they were not. 
This emphasis upon them, however, has had one unfortunate result. 
Too often statistical significance has been taken as settling the question 
at issue, and social significance has been overlooked. An experiment 
might show, for example, that one teaching procedure in handwriting 
produced eighth-grade handwriting of average merit, according to the 
Ayres Scale, of 80, whereas another produced that of 70 only, and the 
control of experimental conditions and number of pupils involved might 
be such that the statistical significance of the difference were very high, 
practically infinity to one. In all too many instances a question has been 
considered as satisfactorily answered when such findings were secured, 
without regard to other factors that should be considered. One of prime 
importance is that of whether writing of a quality better than 70 has 
any, or much, social worth for most individuals. Another is that of 
whether, if the method involved the use of more time, more teacher 
effort, more material resources of any kind, the same expenditure might 
not produce more valuable results if devoted to some other subject of 
instruction or to some other desired outcome of education. Still further, 
there may be other effects upon the pupils concerned than the gain in 
quality of handwriting, and these may be undesirable, so that the 
balance of good over bad results needs to be determined before the 
experiment is finally appraised. 

The reverse of the situation may also be true. I refer to a case 
in which the difference between two procedures is not statistically 
significant to the point of reasonable certainty, but only to that of 
moderate probability. If the difference is in some characteristic which 
is very much desired and to obtain which no other better procedure is 
known, it may be distinctly worth while, from the social viewpoint, to 
adopt the preferred procedure even though there is much doubt as to 
whether it is sure to yield better results. In other words, if a result is 
very valuable, we should try the best available method to obtain it 
even though the promise of success is not very great. 

Overlapping both of these types of cases, yet not just the same 
as either, are those in which the cost, in money, time, or something 
else, should be considered in deciding whether the result is worth 
securing or not. For example, is it worth an additional expenditure 
of five dollars per pupil per year to employ an additional teacher who 
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will devote his time to helping pupils learn to study more effectively if 
thereby they improve their abilities in certain subjects by determined 
and statistically significant amounts? 

Such questions as these cannot be answered by statistical or re- 
search methods, although such procedures may be of assistance in the 
matter. Rather, they demand critical thinking, based upon a com- 
prehensive philosophy of education, indeed of society in general. 

In closing, let me remind you of what I said at the beginning, that 
I do not offer these criticisms as one who would destroy the educational 
measurement movement, but as one who would promote and strengthen 
it by recognizing its weaknesses, claiming for it no more than its ac- 
complishments justify, and seeking so to advance its technics and ap- 
plications as to eliminate these weaknesses and make further advances 
as soon as possible. I am not at all pessimistic about its future, but 
confidently anticipate gradual and never-ending progress, resulting in 
increasing usefulness and greater contributions to educational theory 
and practice. 
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